I � Unicode [pdf]

jrochkind1 · on Oct 26, 2014

Unicode is just about the most technically successful standard I have ever seen, it's pretty amazing.

The weird and complicated parts are all a result of the weirdness and complexity of the domain -- the universe of human written language. All the solutions are amazingly elegant for the domain they are in -- including solutions to legacy backwards compatibility where possible, which have made unicode as successful at catching on as it has been. The decisions on compromises between practial legacy compatibility and pure elegance were _just right_.

The only major mis-step was the "UCS-2" mistake, before they realized more bytes really were going to be needed, sadly now stuck in Java and making proper unicode support in Java way harder than it should be.

But in general, if only all our standards dealing with very complex problems could be as elegantly designed and executed as unicode.

thristian · on Oct 26, 2014

Something that often gets missed out of the Unicode story was that originally there were two groups. The first was the Unicode consortium, who wanted to combine all the world's existing character encodings, and had picked 16-bit units as a comfortable representation, which would have been more than enough for their stated goal.

When Unicode 1.0 came out, there were a bunch of people forming the ISO 10646 committee to produce a character encoding that would cover every human-written character ever, even the ones that weren't already part of an existing encoding, but 16 bits would definitely not be sufficient for that. On the other hand, creating two entirely separate standards wouldn't be a great idea either, so they joined forces and created Unicode 2.0 with astral planes and surrogate pairs and all that expansion business.

The point is, we shouldn't blame the Unicode consortium for short-sightedness, we should blame scope-creep.

jrochkind1 · on Oct 26, 2014

Hm, except I think the scope broadening was a _great_ idea. A standard that only covered the human-written characters that had encodings created for them at that time would have been much less useful than what we've got.

Thanks for your explanation of the history, that it wasn't exactly short-sightedness. Still, I wouldn't "blame" scope-creep -- maybe it's just another unusual example of standards-makers involved here managing to make the right decision at almost every point, even when it involved 'competition' between standards bodies.

The UCS 2 leftover stuff is one of the biggest problems in practical unicode at the moment, alas.

thristian · on Oct 26, 2014

Oh, certainly. In an ideal world we would have had the ISO10646 scope from the start, combined with maybe UTF-8. I do occaionally come across people "explaining" UTF-16 by saying the Unicode consortium couldn't count, which I feel is unfair even if it's a lie-to-children.

vorg · on Oct 27, 2014

> The only major mis-step was the "UCS-2" mistake, before they realized more bytes really were going to be needed

Unicode 2.0 came out in 1996 increasing the codepoint count from about 65,000 to 2.1 billion, but in 2003 came a bigger misstep: those 2.1 billion were reduced back down to about 1 million. And of those million, only 137,000 are for private use. Neither short-sightedness nor scope-creep are to blame, IMO, but power plays and office politics.

jrochkind1 · on Oct 27, 2014

I'm interested in reading more about this, can you give a cite to any URL talking more about this reduction in codepoint space? Haven't been succesful finding info on google.

vorg · on Oct 27, 2014

http://en.wikipedia.org/wiki/UTF-8#Description describes Ken Thompson's original proposal for UTF-8 with codepoints up to U+7FFFFFFF, and how the upper limit was reduced to U+10FFFF in 2003.

I've actually written a package for Go https://github.com/gavingroovygrover/utf88 that uses the 2 private use planes to implement a surrogate system that restores that upper limit to UTF-8, which could easily also be applied to UTF-16 and UTF-32. It's all part of looking forward to the day UTF-8 as originally specified is reinstated.

jrochkind1 · on Oct 28, 2014

Thanks. Do you have a need today for more than 1 million codepoints for some reason? Or you are just predicting that we collectively will eventually? Or you just miss the theoretical elegance? Or what?

vorg · on Oct 28, 2014

Only 137,000 of the 1 million Unicode points are categorized as Private Use, and the library I wrote addresses the need for more private use codepoints. A suggested path for people needing private use codepoints is:

* Use the 6400 codepoints in the BMP

* Begin using plane 0xF beginning at U+F0000

* After using the 32,000 codepoints in the bottom half of 0xF up to U+F7FFF, if you're certain you won't need more than 137,000 codepoints, continue using plane 0xF from U+F8000 onwards

* Otherwise, skip the top half of plane 0xF, and go straight to plane 0x10 beginning at U+100000, and start using the surrogates when you get to U+110000

As for needing more than a million codepoints, I've been looking at whether it's possible to represent CJK Unihan characters compositionally in the 6-byte codepoints (U+4000000 to U+7FFFFFFF) in the same way Korean Hangul has been represented since Unicode 2.0, thus creating many more Unihan characters that don't presently exist. Perhaps there'll be a demand for them one day!

masklinn · on Oct 26, 2014

> sadly now stuck in Java

If only it were only java…

It's also encoded in the JS spec, and throughout the Windows ecosystem… and anything spawned from the latter which is why bloody UEFI uses UCS-2.

guard-of-terra · on Oct 26, 2014

It's a shame they use ISO-8859-5 as an example because it was never used by anyone in practice. It's a stillborn standard.

First, we had IBM-866 and КОИ-8 aka KOI8-R, then painfully switched to WINDOWS-1251, and then Unicode. ISO-8859-5 was never adopted by anyone.

Moru · on Oct 26, 2014

That's funny, this survey says it's still in use: http://w3techs.com/technologies/details/en-iso885905/all/all Quelle.ru uses iso-8859-5.

guard-of-terra · on Oct 26, 2014

http://w3techs.com/technologies/details/en-windows1251/all/a...

Three orders of magnitude more for the de-facto standard. Which is, according to the site, third popular after unicode and latin1.

Of course, it's technically possible to use ISO-8859-5, some misguided software might even default to it. But it's exceedingly rare and doesn't have a point. Even KOI8-R is ten times more widely used.

TheLoneWolfling · on Oct 26, 2014

Here's my thoughts on unicode:

Options:

1) Use UTF-32 everywhere. When space is an issue, just compress it - especially on disk. If you need random access to a string, use a seekable compression algorithm on it on-the-fly. Alternatively, use a compression algorithm with checkpoints and maintain a sorted list of where checkpoints start and how far along in the associated decompressed text you are. (Effectively rolling your own.) Note that this method doesn't work well with writes.

2) Use an interesting variant of a rope. Use a rope, but a) keep track of "logical characters" instead of code points - what unicode calls graphimes, IIRC, and b) have each node have an encoding - and restrict that all characters within a specific node have the same width. This allows for pretty much everything being sublinear. If you allow a bit in a node for "special" nodes (i.e. reversed, lazy-loaded, slice of another node, that sort of thing), reversing, among other things, is actually truly O(1). Bunches of optimizations here - you want to fall back to a "node" that's a flat array for small strings, you want to potentially use overlong encodings internally where appropriate (i.e. if you have 1 1-byte character in a bunch of 2-byte characters, that sort of thing), you want to have some encodings that aren't fixed-width (for things like reading a bunch of bytes from a file), you want to have an encoding that's "unknown" / binary data.

Thoughts:

1) Why on earth does any higher-level language still use byte or codepoint counts for length? And why don't lower-level languages at least have a way to count / index by graphimes?

2) I do not like UTF-8 / 16. It's effectively bad huffman encoding. It's an attempt to save space, but it doesn't even do that well. About the only advantage of UTF-8 is that ASCII maps to it reasonably well. And it has a bunch of disadvantages, chief among them being that if you write a single miltibyte character, you potentially have to rewrite the entire string.

masklinn · on Oct 27, 2014

> Use UTF-32 everywhere. When space is an issue, just compress it - especially on disk.

> I do not like UTF-8 / 16. It's effectively bad huffman encoding. It's an attempt to save space, but it doesn't even do that well.

UTF-8 + gzip is 32% smaller than UTF-32 + gzip using the HN frontpage as corpus. Even using xz, it's a 13% gain.

> About the only advantage of UTF-8 is that ASCII maps to it reasonably well.

That's a pretty huge advantage, and a big reason why UTF-8 is actually popular. An other one is UTF-8 being byte-based, it does not care for byte order. UTF-32 is split between BE and LE, and requires either out-of-band byte-order communication or a BOM.

> Why on earth does any higher-level language still use byte or codepoint counts for length?

Because it's easy, and generally O(1) in these languages. Can also be useful to know how much space it'll take when stored, which really is the only useful use for a string length.

> And why don't lower-level languages at least have a way to count / index by graphimes?

Counting graphemes is no more useful than counting bytes or codepoints. You could provide a grapheme cluster count, but:

1. that's O(n) period

2. it serves very little purpose since clusters don't have a fixed width, not even with a fixed-width font

3. clusters can be locale-dependent ("tailored" clusters) although the default set is locale-independent. Now you need to ponder whether you include tailored clusters, don't include them, or optionally include them

4. clusters and glyphs are independent, "ch" is a grapheme cluster in Slovak but two glyphs on-screen, whereas an "ﬁ" ligature is a single glyph but two clusters

jrochkind1 · on Oct 27, 2014

> About the only advantage of UTF-8 is that ASCII maps to it reasonably well.

Yep, and it's a HUGE advantage, I think it accounts for much of the success of unicode adoption.

nabla9 · on Oct 26, 2014

User-perceived characters are not graphemes, they are grapheme clusters.

You can look at unicode stings in at least four different levels of abstraction: bytes, code points, code units and grapheme clusters. Only advantage UTF-32 has over others is that allcode units fit into single code point (atleast I think so)

If you want a vector where each user-perceived character and whitespace matches one element in the vector, probably the easiest way is to create vector where each element is short unicode string that matches grapheme cluster.

deathanatos · on Oct 26, 2014

> Only advantage UTF-32 has over others is that allcode units fit into single code point (atleast I think so)

All code points can be encoded as a single code unit in UTF-32. Code points are the things like U+0065 LATIN SMALL LETTER E; code units are what you encode code points as in a given Unicode encoding — i.e., octets in UTF-8 and 32-bit integers in UTF-32.

nabla9 · on Oct 26, 2014

Yes. Thank you.

Things that don't necessarily fit into single UTF-32 code unit: combining character sequence and grapheme cluster.

masklinn · on Oct 27, 2014

> You can look at unicode stings in at least four different levels of abstraction: bytes, code points, code units and grapheme clusters.

There's also glyphs, which is what you get after the final rendering through a font, and which may not map 1:1 to any other level of abstraction.

deathanatos · on Oct 26, 2014

> 1) Why on earth does any higher-level language still use byte or codepoint counts for length?

For the higher-level languages, I believe both Haskell and Python¹ now return code point lengths when their length function is called on a string.

> And why don't lower-level languages at least have a way to count / index by graphimes?

Even getting a code point count is difficult in most of those, sadly.

> If you need random access to a string

I really think that random access is not something you greatly need for working with strings, and that most operations are going to scan (linearly) into the string. (For example, splitting on a character requires first finding that character, which is a linear scan that can return an iterator to that position: random indexing is not required.) Sadly, most languages I've worked with, with the exception of C++, do not make great use of the concept of iterators.

¹A recent version of Python 3 is required.

grimgrin · on Oct 26, 2014

Love Unicode? Then Butts Institute may be for you!

http://butts.institute

But in all seriousness, you may enjoy this thing my friend cooked up.

"With over a million billion codepoints, Unicode offers a vast array of unique characters — perfect for microblogging. [Butts Institute] helps you keep your own personal Unicode character updated, instantly, as often as you like! It's fast, convenient, fun, social, and totally free!"

Just make an options request to:

curl -X OPTIONS http://u.butts.institute

https://gist.githubusercontent.com/shmup/e92dad275bcca9287aa...

asgard1024 · on Oct 26, 2014

Unicode kinda jumped the shark with all the emoji.. they might as well encode all frequent words/meanings.

Though I like them. The only emoticon(s) I always miss is for one "shrug", something like either "I don't know" or "I don't care".

thristian · on Oct 26, 2014

One of the principles of Unicode is 1:1 round-trip-encoding to every other encoding, which is why there's so many pre-composed letters and accents even though the individual letters and accents are available separately.

All the encoded emoji have been used for decades by Japanese phones; it's certainly not as well-thought-out as the rest of Unicode, but Apple and Google needed to have some common representation of those characters in their Unicode-based software so they could interoperate with Japanese telecommunications infrastructure. Adding emoji to Unicode was the least terrible option.

Animats · on Oct 26, 2014

I've seen Unicode characters for Facebook, Twitter, etc. icons. So far, they've been in user-defined fonts, in user-defined expansion space. But I suspect there will be pressure to put them in the standard.

thristian · on Oct 26, 2014

The original set of emoji used in Japanese phones had ten characters reserved for country flags, but as you can imagine ISO were very much not keen to include one particular set of countries to the exclusion of all the others, or to specify a particular flag design for a particular country. The solution they went with was to add 26 "regional indicator symbol" characters, and if you use those flag-code characters to spell an ISO country-code, rendering software is supposed to look up the current flag of the named country and display it.

If they went to such lengths to avoid adding flags to the spec, imagine how much pushback you'd get for trying to add company logos.

rspeer · on Oct 27, 2014

Unicode wants to go on for decades. They don't want to become a shrine to obsolete corporate logos.

Private Use is a fine place for logos.

beefburger · on Oct 26, 2014

Another missing symbol is the Unicode logo.

rwg · on Oct 26, 2014

Another fun Unicode-related bug in OS X 10.9:

    % printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc
    Assertion failed: (width > 0), function conv_c, file /SourceCache/shell_cmds/shell_cmds-175/hexdump/conv.c, line 137.
    0000000    U   n   i   c   o   d   e       s   t   r   i   k   e 
    zsh: done       printf 'Unicode strike\xcd\x9bs again' | 
    zsh: abort      LANG=en_US.UTF-8 od -tc

I don't know if this is fixed in OS X 10.10 — I filed a bug with Apple a year ago, but it was marked as a duplicate of another bug. The only thing I can see about that other bug is that it's now closed.

kalleboo · on Oct 26, 2014

Not crashing for me in 10.10

    ~$ printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc
    0000000    U   n   i   c   o   d   e       s   t   r   i   k   e    ͛  **
    0000020    s       a   g   a   i   n                                    
    0000027

kcdr · on Oct 26, 2014

Looks like it is fixed

$ printf 'Unicode strike\xcd\x9bs again' | LANG=en_US.UTF-8 od -tc 0000000 U n i c o d e s t r i k e ͛ 0000020 s a g a i n 0000027

mathias · on Oct 27, 2014

More details on the Unicode regex problems in JavaScript (slide 62) and how ES6 will solve most of these issues: https://mathiasbynens.be/notes/es6-unicode-regex

walrus · on Oct 26, 2014

To whoever changed the title: it really was supposed to be "I � Unicode", not "I Love Unicode". The "�" symbol is embedded in the document as a raster image, so the author really meant for it to be that; it wasn't just a font rendering issue on your end.

dang · on Oct 26, 2014

We changed it back.

Edit: Normally we take attention-grabbing Unicode glyphs out of titles since they disrupt the placid bookishness of HN's front page. But this is one is so tasteful and content-appropriate that it seems obviously a special case.

walrus · on Oct 26, 2014

Thanks!

beefburger · on Oct 26, 2014

Author here. The title was a pun. � is U+FFFD REPLACEMENT CHARACTER which may possibly replace a badly encoded heart.

WallyL · on Oct 26, 2014

Is the video of the talk available somewhere?

nire · on Oct 26, 2014

I'm one of the conf organizers: we were on a budget and didn't record any talk. Will try to fix that for next year.

general_failure · on Oct 26, 2014

Recording a talk only takes a phone these days? I know I know it won't be professional recording but I have good luck with just recording from the front row.

panzi · on Oct 26, 2014

It won't be a recording that you could bear listening to. Even more professional setups/bigger conferences often screw up the audio so that the recording is basically binnable.

zanny · on Oct 26, 2014

Can't you just record the audio via whatever computer the mic is attached to?

Syncing video to audio is always a pain, but I wouldn't mind desync to see it.

panzi · on Oct 28, 2014

Yeah but for some reason they don't always do this. Also in case they actually do that part right, they often don't give people who ask questions a mic and don't repeat their questions. Then you don't have any idea what the presenter is talking about.

sauere · on Oct 26, 2014

I agree. A cheap (iPhone) recording is still better than nothing.