It does not. Your code is using precomposed instead of decomposed chars.
Try:
let s = String::from("noe\u{0308}l");
println!("Reversable? {}", s.chars().rev().collect::<String>());
println!(
"First three characters? {}",
s.chars().take(3).collect::<String>()
);
you get l̈eon which (according to the article) is wrong; likewise "noe" as the first three chars, dropping the diacritic.
1. One code point: U+00EB. This is the "precomposed" form.
2. Two code points: U+0065 U+0308, aka e followed by ¨. This is the "decomposed" form, also known as a "combining sequence" since the diaeresis combines with the base character.
If your string type is a sequence of code points, then reversing a decomposed string will tear the combining sequence, and apply the diaeresis to the wrong character. Most string types are affected by this (or worse), Rust included.
The two forms get rendered identically, so you more or less need a hex editor to figure out which form you've got.
I forked your repo and switched it to decomposed (the diff looks like a noop), and now it produces the wrong output:
One could reasonably conclude that precomposed forms are just better and easier. But they're considered legacy: we can't encode every possible combining sequence into a code point, so we might as well go the other way and decompose whenever possible. That's what Normalization Form D is about.
Much obliged, thank you. I'm adding info about this to the repo, and also adding precomposed characters versus decomposed characters. Now I do see the problem you're describing and the article author is describing.
The whole problem seems to start with a clear definition of what a string is. Java (IIRC) defines it as a list of code points, and its behavior follows from that. It's even "correct" in the sense of correct according to that definition.
So what would the right definition of a string be that implies everything the author would consider "correct" behaviour?
I've actually run into the same question before and so far my conclusion wasn't even that the implementation of strings is wrong on most platforms, but rather that "string" as a concept just doesn't make sense. Including simple things like the length of a string or concatenation.
If somebody knows a useful definition, I'd be interested ;)
Mortoray, I'm inclined to agree with you. Modern strings/unicode has completely confused quite different things - storage (bytes), glyphs, typesetting, and language. So we're mixing single characters, with multiple characters, with character construction, with non-character emojis.
We are almost getting away with this confusion by using brute force - "check out the CPU power of my phone". We're probably struck with it for interweb purposes, but I'd like to see a nice clean alternative, or a cleaned-up version of unicode ... something we could use in small embedded/IoT scale systems.
Reversing strings is such an odd thing to focus on. I have written dozens of implementations of string reversers for school and interviews and following books, but I would be hard pressed to come up with any real case where I needed to reverse a string. And then, it was because I was abusing a string to store some data that was not really text, at which point I really didn't care that it textually supported reversing Unicode code points.
It's not about reversing, but about checking whether reverse iteration works. This is required to perform proper truncation (for limited length fields) as well as visual editing for a cursor to go through the text.
It can additionally be an indication as to whether regular expression matching works, though that is usually handled by a library, and not the core string types.
My contention was always that if a "string" does not reverse text, then it's no better than an "array<char>". The existence of a "string" type implies it should do more.
It's more meaningless the more a naive algorithm is broken... But shouldn't it at least use those reverse letters Unicode has somewhere? Or is it just a matter of putting the right to left symbol at its start?
I would be very surprised if rev | cut | rev gave you the correct answer to anything that isn't "drop the last X lines or words" on a non western alphabet.
And for "drop the last X lines or words", you would do better with some specialized command.
you don't ever work with directories? fortunately there is a pretty defined separator in my directory hierarchies, so that is pretty much always correct ;)
(although if you are taking arbitrary user data I suppose there is nothing stopping you doing something stupid like putting an escaped '/' in a filename, which would break this.)
Any Swift developers who know how it fares on these things? I've heard it does much better, largely by virtue of the Character type being an extended grapheme cluster, but does it handle all these cases right?
I recently did a little survey of string types across programming languages, and found that they display an immense variety of designs for a datatype so fundamental.
Swift stakes out one of the most interesting points in the design space by completely encapsulating the internal representation and requiring all accessors to be explicit in whether they operate on the string's encoding (e.g. as UTF-8), its Unicode code points, or its grapheme clusters. All positions within the string are represented using abstract iterators that don't reveal their numeric quantities.
Diametrically opposite in the design space is Go (close to C and C++), in which a string is simply an array of bytes, conventionally treated as UTF-8. To access code points requires explicit decoding (or a range loop), and to access grapheme clusters requires a text package.
Ruby is another interesting choice: a string is a pair of a (mutable!) byte array and value indicating how to decode those bytes into code points. This representation works well when you have to deal with strings in obscure encodings as it preserves the original bytes, so, for example, you can read file names from a directory then delete those files.
There are a great many other designs in wide use---Haskell, Java, and Python being the most unnecessarily complicated---but these three designs seemed the most defensible to me.
Raku also has a specific string type (Str) that is not just a bag of bytes. Strings are normalized (NFC) when they are created, and multi-codepoint graphemes are represented internally using synthetic codepoints (negative integers), but this is all opaque to the user. This also means that string indexing in Raku is O(1).
Raku passes all the tests. He mentions he didn't find any language that upper-cases "baffle" correctly but Raku does.
Here's my REPL test
> my $n = "noe\x[0308]l"
noël
> $n.chars
4
> $n.flip
lëon
> $n.substr(0, 3)
noë
> my $b = "baffle"
baffle
> $b.substr(2, 1).uniname
LATIN SMALL LIGATURE FFL
> $b.uc
BAFFLE
> "noe\x[0308]l" eq "no\x[00EB]l"
True
HN doesn't handle the cat emojis, so they are not included, but they worked fine.
Technically it is MoarVM that handles graphemes as synthetic codepoints.
We call it NFG for Normalized Form Grapheme.
The JVM and JavaScript backends currently use the built-in string handling features. Which means that they are just as broken as the rest of the code running on them.
---
Of further note, you can use ignorecase in regexes to match "baffle".
> Swift stakes out one of the most interesting points in the design space by completely encapsulating the internal representation and requiring all accessors to be explicit in whether they operate on the string's encoding (e.g. as UTF-8), its Unicode code points, or its grapheme clusters. All positions within the string are represented using abstract iterators that don't reveal their numeric quantities.
This is probably the only right way because 1. it admits multiple implementations (like packing into tagged pointers) and 2. a string isn't simply an array of anything. Strings are made of all of bytes/codepoints/grapheme clusters/words/lines/etc at different levels.
Haskell just uses a list of Unicode code points, i.e., type String = [Char]. Not a good choice, but not a particularly complicated one either.
What does make it a bit complicated is that there are about four other different string types to choose between, besides the default String type, which should never be used for anything performance-sensitive.
Now I am curious to look at Swift's strings. I am a big fan of explicit encodings. I have long believed that an ideal text API would not use string as a concrete data type. Instead, encodings themselves (UTF-8 et al.) should be concrete types and string should be nothing more than an abstraction over them for polymorphism.
All of the things I know the answers to off the top of my head it passes, but I’d have to check a few of the others (e.g. the “ffl” ligature) I’d have to check.
Though I would bet this mainly has to do with being a language born in the age of the web. It seems like something that would be very hard to back-port (as opposed to designing for it from the beginning), but proper handling of Unicode is practically table-stakes for a new language in today's world. I would guess (I don't know) that Swift, Go, etc also do a good job with this stuff.
.chars() iterates over unicode scalar values, not actual grapheme clusters, so this code would also be incorrect in many circumstances. Ultimately it is incumbent on the developer to understand what their actual unicode-processing requirements are.
It looks like it still misses some of the cases (specifically with decomposed vs precomposed).
As a side note, it looks like the playground editor also doesn't correctly render some of the lengths (the cursor is 1 character off on the decomposed case)
Huh, I'd never heard of decomposed/precomposed before and I kind of skimmed over them in the original article. Looking at your example, I still can't figure out what the difference is between the two (I guess the Playground editor itself just can't represent the difference?). I'll have to go read up on that concept.
Edit: Having read about them, I guess I assumed that if my keyboard inputs an ë it's actually inputting those two underlying characters. Or to put it differently, I assumed there weren't two separate representations for the same character; that the "decomposed" and "precomposed" versions are one in the same, at the byte level. Learned something h̶o̶r̶r̶i̶f̶y̶i̶n̶g̶ new!
Table stakes are a required minimum bet in a game (often poker) - the meaning being if you can’t provide at least this you can’t play. So if a new language doesn’t support Unicode well it won’t get any adoption.