PyPy - Python 3k Status #5 update

saurik · on July 10, 2012

Paths and file names on Unix fundamentally do not have encoding: the filesystem represents them as sequences of bytes, and it is entirely possible to have a directory full of folders where there is no codec that is capable of faithfully or even reasonably decoding all of the names contained (or even that Unicode itself is capable of representing the semantically correct decoding of the filenames if you knew what theoretical codec represented them in the first place).

It is therefore a fundamental mistake and a misinterpretation of the semantics of filesystems by Python 3 to insist that file names are represented by Unicode strings with a locale-sensitive encoding; in fact, I question whether there are interesting security ramifications inherent in this mistake (such as allowing me to change the locale in which a process is running and thereby remap its import path; this coming up, of course, as this article is largely about sys.path and Unicode).

thristian · on July 11, 2012

Well, it's a misinterpretation of the semantics of traditional Unix filesystems. Other filesystems used by other operating systems (such as Windows' NTFS and OS X's HFS+) genuinely do store filenames as Unicode strings, so on those platforms (or on Unix, if you've mounted one of those filesystems) Python 3.x's approach is exactly correct.

That said, recent versions of Python 3 include a workaround for exactly the problem you describe: when the interpreter gets a byte-sequence value from the OS, filenames that decode to Unicode cleanly will be represented as proper Unicode strings, while filenames that can't be decoded will have the raw bytes represented as code-points in Unicode's Private Use Area. That way, even if Python can't decode the contents of a string, you can still, say, get a parameter from the command-line, pass it to open(), and be confident that you'll actually get the file the user intended.

saurik · on July 11, 2012

While I reference the issue of a mixed-codec directory in order to make clear the flaw in the operating assumption, the actual problem I am interesting in here, and which I conclude my statements with a reference to, is handling things like sys.path (hence this being a reply to the article).

To respond to your comments, however: actually, the behavior on, let's say a Linux box, if you mount one of those aforementioned filesystems, will not be "exactly correct". Python 3 will attempt to encode the Unicode string using the current locale, pass it to the underlying Unix open() function, which will then have no clue what to do as it hits the filesystem.

In fact, rather than just idly claiming this, I went ahead and set up exactly this test setup on one of my servers. I created a python3 script in a known specific source encoding (UTF-8) and asked it, in each of two different locales, to make a file that included an accented character, while mounted on an HFS+ disk image.

    hfs+# mount | grep hfs
    /.../hfs+.img on /.../hfs+ type hfsplus (rw,force)
    hfs+# cat test.py
    #!/usr/bin/python3
    # -*- coding: utf-8 -*-
    open("helloä", "w")
    hfs+# LANG=en_US.UTF-8 ./test.py
    hfs+# LANG=fr_FR.ISO-8859-1 ./test.py
    hfs+# LANG=en_US.UTF-8 ls -la
    -rw-r--r-- 1 root root  0 2012-07-11 00:57 hello?
    -rw-r--r-- 1 root root  0 2012-07-11 00:57 helloä
    -rwxr-xr-x 1 root root 64 2012-07-11 00:56 test.py*
    hfs+#

As you can see, the behavior here is really poor for something that claims to support Unicode. What we would expect to have happen is that, as I opened the file with a Unicode name on a Unicode filesystem that I would actually get the specific Unicode string that I had wanted.

Instead, because Python 3 is only pretending to understand the semantics of filesystems with regards to character sets, and in fact has no way of taking advantage of the Unicode support in HFS+, we ended up with the encoding of the user's locale environment breaking our filenames.

This would be akin to me doing JSON.encode() in a browser, and having a Unicode JavaScript string get converted into different JSON (which is also represented as a Unicode JavaScript string) depending on what language the user's browser is configured to use: that is a miserable Unicode failure, and not a success story.

FWIW, the exact same code does work on Mac OS X (using the correct locale name of fr_FR.ISO8859-1): both versions of the script get the exact same filename. To be honest, I was somewhat surprised that they got that right ;P. I sadly do not have a Windows computer handy, as I'd love to see how it handles NTFS (which is technically UCS-2 and doesn't have the weird canonicalization behavior that HFS+ sometimes does: I can easily imagine broken corner cases with invalid UTF-16 surrogate pairs).

Regardless, I really do not believe that you need to break the behavior on Linux in order to make Mac OS X work correctly. Even if such a tradeoff were required, I question whether it should be resolved with Linux on the losing end. I am not going to say that the correct solution doesn't even use Unicode strings: it just needs to more intelligently handle who is in charge of the encodings and what they semantically mean than Python 3 is prepared to do.

As far as possible solutions go, I certainly do not believe that the private Unicode codepoint solution is either sufficient (as it doesn't solve the de novo filename creation problem) nor even remotely reasonable (as if I attempt to then communicate these filenames with other systems, or heaven-forbid wanted to store a filename with a private Unicode codepoint in it, I'm now screwed).

(edit: Wow, BTW. They added this "solution" after I had already given up on Python 3k, so I hadn't followed up to see what they did. It seems like they ended up not using "private codepoints" as they were considering in 2008, but are instead using surrogate-halves: they are encoding "malformed UTF-8 sequences as malformed UTF-16 sequences"[1]. Given that you are allowed to store malformed UTF-16 on NTFS, I wonder how they expect that to work. Still doesn't solve either of my complaints, though.)

1: http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043...

lmm · on July 11, 2012

If you set LANG=fr_FR.ISO-8859-1 you are declaring that you want your filenames encoded as ISO-8859-1; the behaviour sounds like exactly what I'd expect. It's certainly not "broken on linux"; at worst it's "broken on linux when not using a utf-8 locale", and frankly I'd expect most things to break under that circumstance.

saurik · on July 11, 2012

You seem to be missing the point of this exercise: HFS+ is a Unicode-aware filesystem, so the idea that "you want your filenames encoded as ISO-8859-1" is fundamentally invalid and unimplementable. You can't specify an encoding for the files you are saving, as they are semantically "Unicode" and saved by the filesystem as UTF-16 on disk as an implementation detail.

The reason that HFS+ came up is that, as a filesystem that stores filenames as actual "strings" as opposed to "arrays of bytes", you actually get the correct behavior now on OS X with Python 3 (which you also seem to have missed); thristian contended that this behavior would also work correctly if I mounted the HFS+ image on Linux, and it does not.

To explicitly demonstrate this on Mac OS X (where the filename will be sent as Unicode through to the Unicode filesystem API and then saved correctly through HFS+ to disk with the original name, no matter what encoding you happen to have set as part of your locale, as your locale truly should be irrelevant in this specific situation, just as in my JSON analogy):

    mac$ cat test.py
    #!/Library/Frameworks/Python.framework/Versions/3.2/bin/python3
    # -*- coding: utf-8 -*-
    import locale
    print(locale.getpreferredencoding())
    open("testä", "w")
    mac$ LANG=en_US.UTF-8 ./test.py
    UTF-8
    mac$ LANG=fr_FR.ISO8859-1 ./test.py 
    ISO8859-1
    mac$ ls -la
    total 8
    drwxr-xr-x    4 saurik  staff   136 Jul 11 00:36 ./
    drwxr-xr-x+ 278 saurik  staff  9452 Jul 11 00:28 ../
    -rwxr-xr-x    1 saurik  staff   159 Jul 10 18:07 test.py*
    -rw-r--r--    1 saurik  staff     0 Jul 11 00:36 testä
    mac$

lmm · on July 11, 2012

>HFS+ is a Unicode-aware filesystem, so the idea that "you want your filenames encoded as ISO-8859-1" is fundamentally invalid and unimplementable

Sure. But it's still the idea that linux would apply, and how linux APIs (that expect a filename to be a stream of bytes) work. I really think this is a linux/locale problem rather than a python problem.

saurik · on July 11, 2012

I will yet again repeat: the reason for this excursion into HFS+ semantics on Linux was caused by thristian's insistence that Python's behavior would handle HFS+'s Unicode behavior when mounted on Linux in the same correct way it does on Mac OS X. This is, in fact, false. This then nullifies the argument that this is a filesystem-specific issue.

You seem to be refusing to track this conversation's multiple thoughts: there is the underlying argument "Python 3 is making unreasonable assumptions" with a specific argument "these assumptions are reasonable on OS X" followed by an aside "incidentally, this behavior actually is not related to operating systems but is related to filesystems: as proof I cite HFS+ mounted on Linux" with an error pointed out in the aside "no: in fact HFS+ on Linux has the same behavior as any other filesystem on Linux".

I then separately respond to the point about "these semantics work on OS X" (ceding, in fact, albeit explicitly remaining skeptical on Windows), saying that the tradeoffs of "works worse on Linux" (which I get to assume, as my earlier arguments that this is the case were not actually challenged: that on Linux the concept of encodings does not apply to filenames and causes problems like locale-specific sys.path) seems like the wrong direction to lean (which is an opinion, of course).

However, to make that claim, I need to defend against a new point that is brought up: that thristian believes that an epicycle added to the algorithm (the PUA "save the problem for later" mechanism) is sufficient to mitigate the Linux problems. I claim that it is not, and I bring up a few reasons why (de novo filenames, interop with non-Python systems, existing usages for PUA): reasons which, incidentally, were also discussed as open problems on the Python mailing list.

Finally, I also explained that the PUA solution isn't even being used anymore, but was actually replaced by UTF-8b. As this solves one of my complaints (existing usages for PUA) I then have to first admit that (although I defend that I believe that invalid surrogate pairs are not invalid on Windows, leading to a similar problem) and then, for clarity, mention that my other arguments are not affected by UTF-8b.

lmm · on July 12, 2012

So, in the interests of being perfectly clear: I am challenging your claim that python 3's approach works worse on Linux; I assert that its semantics under linux are correct (i.e. what a well behaved program running under linux-the-system should do). Conceptually, a program should tell the operating system to save a filename under a given name (unicode string); it is then the operating system's responsibility to translate that to and from bytes on disk.

What you have observed, and demonstrated with your example, is the behaviour of linux running with LANG=fr_FR.ISO-8859-1, which is to represent filenames that contain characters not representable in ISO-8859-1 as ?s. Any well-behaved linux program will exhibit the same behaviour, because it is not program behaviour but OS behaviour. Programs that ignore LANG and do their own filename encoding will appear better under your test, but such programs are misbehaving; by declaring LANG=fr_FR.ISO-8859-1 the user has made an explicit declaration that they wish for their filenames to be encoded as ISO-8859-1, and should expect as much.

That filenames of files stored on HFS+ under linux still have linux semantics despite the filesystem's semantics being different is an interesting accident of history but really neither here nor there. The idea that you want your filenames encoded as ISO-8859-1 may indeed be fundamentally invalid and unimplementable on HFS+, but it remains the semantics of setting LANG=fr_FR.ISO-8859-1 on linux, and as such it should be expected that linux would attempt to follow this behaviour as closely as possible.

Really the whole excursion into filesystems is irrelevant. Python 3 behaves correctly operating systems which provide unicode filenames, i.e. "OSX" and "Linux with a UTF8 locale", and as well as could be expected on operating systems where filenames are only permitted to be strings in a particular encoding i.e. "Linux with a non-UTF8 locale".

bcambel · on July 10, 2012

When do you think Py3K will be main stream ? I'm afraid this might be a failing effort. Is there any company out there using Py3K in production ?

kisielk · on July 10, 2012

Do we really need to rehash this tired argument over every single post that references Python 3? You can look at any other Python 3 related post in the last couple of years on HN and Reddit to find endless arguments about this topic.

How about a comment thread that actually talks about the post contents for once?