A Weird Python 3 Unicode Failure

The following code can fail on my system:

from os import listdir
for f in listdir(‘.’):
print(f)
Why?

UnicodeEncodeError: ‘utf-8’ codec can’t encode character ‘\udce9′ in position 13: surrogates not allowed
What?

I have a file with the name b’Latin1 file: \xe9’. This is a filename with a “é” encoded using Latin-1 (which is byte value \xe9)
Python attempts to decode it using the current locale, which is utf-8. Unfortunately, \xe9 is not valid UTF-8, so Python solves this by inserting a surrogate character. So, I get a variable f which can be used to open the file.
However, I cannot print the value of this f because when it attempts to convert back to UTF-8 to print, an error is triggered.
I can understand what is happening, but it’s just a mess. [1]

§

Here is a complete example:

f = open(‘Latin1 file: é’.encode(‘latin1’), ‘w’)
f.write(“My file”)
f.close()

from os import listdir
for f in listdir(‘.’):
print(f)
On a modern Unix system (i.e., one that uses UTF-8 as its system encoding), this will fail.

§

A good essay on the failure of the Python 3 transition is out there to be written.

[1] ls on the same directory generates a placeholder character, which is a decent fallback.

Advertisements

2 thoughts on “A Weird Python 3 Unicode Failure

  1. The fundamental problem here is like lots of people, that Python thinks a filename is string. That’s true on Windows, and maybe even on the Mac, but on most Unix systems a filename is just an array of bytes, arbitrary except that it doesn’t contain ” or ‘/’.

    A Unix filename is really a kind of opaque ID that should not be seen directly be the user. User-facing software must convert them to something nicer, but that conversion operation is not really “someEncoding.encode(bytes)”, rather it is more a heuristic that you might call ‘filenameToUIString(bytes)’. In Python filenames are strings, but they use surrogates to maintain a 1-1 mapping to the underlying byte arrays. So the structure of the problem is the same and you need a function ‘filenameToUIString(string)’

    You can still call that a mess, but when you get the philosophy behind it, it seems more reasonable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s