I find the following hierarchy helpful: bytes -> code units -> code points -> ex...

labster · on Dec 11, 2016

In Perl 6:

    $width-in-terminal = $text.chars;
    $codepoints = $text.codes;
    $bytes = $text.encode('utf8').bytes;

The .chars method should be the fastest, because Perl 6 internally uses strings of fully composed characters (normalized form grapheme). It's much better than having to do regex hacks like in Python.

Someone · on Dec 11, 2016

Given that one can make graphemes that contain many Unicode code points (213 in this extreme example: https://www.reddit.com/r/Unicode/comments/4yie0a/tallest_lon...), I wondered whether that can be correct for any reasonable definition of "fully composed character".

As far as I can tell, it turned out to be correct, though. Perl 6 has its own Inocode normalization variant called NFG (https://design.perl6.org/S15.html#NFG):

"Formally Perl 6 graphemes are defined exactly according to Unicode Grapheme Cluster Boundaries at level "extended" (in contrast to "tailored" or "legacy"), see Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION 3 Grapheme Cluster Boundaries>. This is the same as the Perl 5 character class

   \X   Match Unicode "eXtended grapheme cluster"

With NFG, strings start by being run through the normal NFC process, compressing any given character sequences into precomposed characters.

Any graphemes remaining without precomposed characters, such as ậ or नि, are given their own internal designation to refer to them, at least 32 bits in length, in such a way that they avoid clashing with any potential future changes to Unicode. The mapping between these internal designations and graphemes in this form is not guaranteed constant, even between strings in the same process."

From that, I guess Perl 6 extends its mapping between NFG code points (an extension of Unicode code points) and Unicode graphemes clusters whenever it encounters a grapheme cluster it hasn't seen before. Ignoring performance concerns (might not be bad, but I'm not sure about that), that seems a nice approach.

dom0 · on Dec 11, 2016

    $width-in-terminal = $text.chars;

^- very likely wrong.

wcswidth computes the cell-width (or column-count) of a Unicode string, which is unrelated to the count of graphemes, EGCs or code points. For example, Latin characters are one cell/column wide, while for example many CJK characters occupy two cells/columns, while they are still one EGC.

A typical application is printing CJK things to a terminal mask, progress display or similar.

labster · on Dec 11, 2016

Ah right. I guess we were meaning different things by width. And you're probably more correct here.

Manishearth · on Dec 11, 2016

> It has nothing to do with UTF32

Right, my point is that the concept of a code point is seldom useful unless doing storage stuff with utf32 or implementing unicode algorithms. Python may expose an API of code points but that doesn't mean that it's meaningful.

Performance arguments can be made as to why the API should use code points instead of grapheme clusters, so there are legitimate reasons for Python (and Rust, and many other languages) to do so. Sometimes you just need some comparable notion of length and "number of code points" is acceptable.

However, you should be careful when writing code that confers meaning to the concept of a code point. A lot of code does this (using code points when they mean glyphs or grapheme clusters).

eponeponepon · on Dec 11, 2016

  > Right, my point is that the concept of a code point
  > is seldom useful unless doing storage stuff with 
  > utf32 or implementing unicode algorithms.

In XML land, where strings are almost always UTF8, XPath offers a string-to-codepoints() function that returns a sequence of integers, and a corresponding codepoints-to-string(). These two have been invaluable to me on many occasions when doing string manipulation gymnastics.

Manishearth · on Dec 11, 2016

> when doing string manipulation gymnastics.

What kind of string manipulation gymnastics? I'd be wary of using codepoints for string manipulation for anything other than algorithms where you are explicitly asked to (e.g. algorithms that implement operations from the unicode spec)

ubernostrum · on Dec 11, 2016

On the other hand, there are algorithms embedded in widely-deployed standards which are defined in terms of code points.

For example, one I know quite well from having implemented it in Python: the HTML5 color parsing algorithm (the one that turns even incredible junk strings like "chucknorris" into color values) requires, in step 7 of the parsing process, replacing any code point higher than U+FFFF with the sequence '00' (that's two instances of U+0030 DIGIT ZERO).

And personally I think code points, as the basic atomic units of Unicode, do make sense as the things strings are made up of; I wish Python had better support for identifying graphemes without third-party libraries, but since Unicode encodings all map back to code points it makes sense to me that a Unicode string is a sequence of those rather than a sequence of some more-complex concept.

Manishearth · on Dec 11, 2016

> there are algorithms embedded in widely-deployed standards which are defined in terms of code points.

From my original comment:

> You only care about code points when dealing with UTF32 strings or when implementing operations on unicode text.

These operations fall in the latter. It's still pretty niche. If an algorithm is defined explicitly in terms of code points this makes sense. Stuff starts falling apart when people assign meaning to code points and use it as a placeholder for other concepts like glyph or columns of grapheme cluster.