Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The string type is broken (2013) (mortoray.com)
19 points by leephillips on June 23, 2021 | hide | past | favorite | 42 comments


Rust 1.53 seems to handle these correctly - feedback welcome.

    let s = String::from("noël");
    println!("Printable? {}", s);
    println!("Countable? {}", s.chars().count());
    println!("Reversable? {}", s.chars().rev().collect::<String>());
    println!("First three characters? {}", s.chars().take(3).collect::<String>());
Output:

    Printable? noël
    Countable? 4
    Reversable? lëon
    First three characters? noë
Repo for the "noël" example and "cats" example: https://github.com/joelparkerhenderson/demo-rust-string-issu...


It does not. Your code is using precomposed instead of decomposed chars.

Try:

    let s = String::from("noe\u{0308}l");
    println!("Reversable? {}", s.chars().rev().collect::<String>());
    println!(
        "First three characters? {}",
        s.chars().take(3).collect::<String>()
    );
you get l̈eon which (according to the article) is wrong; likewise "noe" as the first three chars, dropping the diacritic.


Although it's not in the standard library, this can be handled properly using the unicode-segmentation crate: https://crates.io/crates/unicode-segmentation


It's a bit disappointing that a relatively new language like Rust doesn't handle these correctly.

There's no value in a String type if it doesn't behave for text. One can simply use a "Array<char>" to convey proper semantic meaning.


Can you explain more?

I updated my post with more info and a link to a code repo.

The code correctly shows me the reverse "lëon" with the umlaut, and the first three characters "noë" with the umlaut.


ë may be represented in two ways:

1. One code point: U+00EB. This is the "precomposed" form.

2. Two code points: U+0065 U+0308, aka e followed by ¨. This is the "decomposed" form, also known as a "combining sequence" since the diaeresis combines with the base character.

If your string type is a sequence of code points, then reversing a decomposed string will tear the combining sequence, and apply the diaeresis to the wrong character. Most string types are affected by this (or worse), Rust included.

The two forms get rendered identically, so you more or less need a hex editor to figure out which form you've got. I forked your repo and switched it to decomposed (the diff looks like a noop), and now it produces the wrong output:

https://github.com/ridiculousfish/demo-rust-string-issues

Another Rust example using explicit Unicode literals: https://play.rust-lang.org/?version=stable&mode=debug&editio...

One could reasonably conclude that precomposed forms are just better and easier. But they're considered legacy: we can't encode every possible combining sequence into a code point, so we might as well go the other way and decompose whenever possible. That's what Normalization Form D is about.


Much obliged, thank you. I'm adding info about this to the repo, and also adding precomposed characters versus decomposed characters. Now I do see the problem you're describing and the article author is describing.



The whole problem seems to start with a clear definition of what a string is. Java (IIRC) defines it as a list of code points, and its behavior follows from that. It's even "correct" in the sense of correct according to that definition.

So what would the right definition of a string be that implies everything the author would consider "correct" behaviour?

I've actually run into the same question before and so far my conclusion wasn't even that the implementation of strings is wrong on most platforms, but rather that "string" as a concept just doesn't make sense. Including simple things like the length of a string or concatenation.

If somebody knows a useful definition, I'd be interested ;)


The author of the article appears to think "a sequence of graphemes" is the right abstraction.


Mortoray, I'm inclined to agree with you. Modern strings/unicode has completely confused quite different things - storage (bytes), glyphs, typesetting, and language. So we're mixing single characters, with multiple characters, with character construction, with non-character emojis.

We are almost getting away with this confusion by using brute force - "check out the CPU power of my phone". We're probably struck with it for interweb purposes, but I'd like to see a nice clean alternative, or a cleaned-up version of unicode ... something we could use in small embedded/IoT scale systems.


Reversing strings is such an odd thing to focus on. I have written dozens of implementations of string reversers for school and interviews and following books, but I would be hard pressed to come up with any real case where I needed to reverse a string. And then, it was because I was abusing a string to store some data that was not really text, at which point I really didn't care that it textually supported reversing Unicode code points.


It's not about reversing, but about checking whether reverse iteration works. This is required to perform proper truncation (for limited length fields) as well as visual editing for a cursor to go through the text.

It can additionally be an indication as to whether regular expression matching works, though that is usually handled by a library, and not the core string types.

My contention was always that if a "string" does not reverse text, then it's no better than an "array<char>". The existence of a "string" type implies it should do more.


What is even the meaning of reversing an string?

It's more meaningless the more a naive algorithm is broken... But shouldn't it at least use those reverse letters Unicode has somewhere? Or is it just a matter of putting the right to left symbol at its start?


rev | cut | rev

is a pretty common idiom in the shell when you want to keep "all but the last N chunks" from some lines of input.

At least for me. I'm sure there's probably a better way, and performance of this is definitely notably poor.

And of course it's not like I'm actually writing the utility. ;)


I would be very surprised if rev | cut | rev gave you the correct answer to anything that isn't "drop the last X lines or words" on a non western alphabet.

And for "drop the last X lines or words", you would do better with some specialized command.


you don't ever work with directories? fortunately there is a pretty defined separator in my directory hierarchies, so that is pretty much always correct ;)

(although if you are taking arbitrary user data I suppose there is nothing stopping you doing something stupid like putting an escaped '/' in a filename, which would break this.)


Any Swift developers who know how it fares on these things? I've heard it does much better, largely by virtue of the Character type being an extended grapheme cluster, but does it handle all these cases right?


I recently did a little survey of string types across programming languages, and found that they display an immense variety of designs for a datatype so fundamental.

Swift stakes out one of the most interesting points in the design space by completely encapsulating the internal representation and requiring all accessors to be explicit in whether they operate on the string's encoding (e.g. as UTF-8), its Unicode code points, or its grapheme clusters. All positions within the string are represented using abstract iterators that don't reveal their numeric quantities.

Diametrically opposite in the design space is Go (close to C and C++), in which a string is simply an array of bytes, conventionally treated as UTF-8. To access code points requires explicit decoding (or a range loop), and to access grapheme clusters requires a text package.

Ruby is another interesting choice: a string is a pair of a (mutable!) byte array and value indicating how to decode those bytes into code points. This representation works well when you have to deal with strings in obscure encodings as it preserves the original bytes, so, for example, you can read file names from a directory then delete those files.

There are a great many other designs in wide use---Haskell, Java, and Python being the most unnecessarily complicated---but these three designs seemed the most defensible to me.


Raku also has a specific string type (Str) that is not just a bag of bytes. Strings are normalized (NFC) when they are created, and multi-codepoint graphemes are represented internally using synthetic codepoints (negative integers), but this is all opaque to the user. This also means that string indexing in Raku is O(1).

Raku passes all the tests. He mentions he didn't find any language that upper-cases "baffle" correctly but Raku does.

Here's my REPL test

    > my $n = "noe\x[0308]l"
    noël
    > $n.chars
    4
    > $n.flip
    lëon
    > $n.substr(0, 3)
    noë
    > my $b = "baffle"
    baffle
    > $b.substr(2, 1).uniname
    LATIN SMALL LIGATURE FFL
    > $b.uc
    BAFFLE
    > "noe\x[0308]l" eq "no\x[00EB]l"
    True
HN doesn't handle the cat emojis, so they are not included, but they worked fine.


Technically it is MoarVM that handles graphemes as synthetic codepoints.

We call it NFG for Normalized Form Grapheme.

The JVM and JavaScript backends currently use the built-in string handling features. Which means that they are just as broken as the rest of the code running on them.

---

Of further note, you can use ignorecase in regexes to match "baffle".

    "baffle" ~~ /:ignorecase  BAFfLE  /;


> Swift stakes out one of the most interesting points in the design space by completely encapsulating the internal representation and requiring all accessors to be explicit in whether they operate on the string's encoding (e.g. as UTF-8), its Unicode code points, or its grapheme clusters. All positions within the string are represented using abstract iterators that don't reveal their numeric quantities.

This is probably the only right way because 1. it admits multiple implementations (like packing into tagged pointers) and 2. a string isn't simply an array of anything. Strings are made of all of bytes/codepoints/grapheme clusters/words/lines/etc at different levels.


If my survey revealed anything, it's that there is no one right way.


Haskell just uses a list of Unicode code points, i.e., type String = [Char]. Not a good choice, but not a particularly complicated one either.

What does make it a bit complicated is that there are about four other different string types to choose between, besides the default String type, which should never be used for anything performance-sensitive.


Now I am curious to look at Swift's strings. I am a big fan of explicit encodings. I have long believed that an ideal text API would not use string as a concrete data type. Instead, encodings themselves (UTF-8 et al.) should be concrete types and string should be nothing more than an abstraction over them for polymorphism.


Java's MUTF-8 makes me want to eject it into the Sun. Makes interop with JNI extra fun.


All of the things I know the answers to off the top of my head it passes, but I’d have to check a few of the others (e.g. the “ffl” ligature) I’d have to check.


(Addendum: "baffle".uppercased() does correctly return "BAFFLE".)


Rust's string type doesn't have the mentioned problems: https://play.rust-lang.org/?version=stable&mode=debug&editio...

Though I would bet this mainly has to do with being a language born in the age of the web. It seems like something that would be very hard to back-port (as opposed to designing for it from the beginning), but proper handling of Unicode is practically table-stakes for a new language in today's world. I would guess (I don't know) that Swift, Go, etc also do a good job with this stuff.


It does for multi-codepoint characters (such as the rainbow flag): https://play.rust-lang.org/?version=stable&mode=debug&editio...

Elixir gets it right: https://replit.com/@natanbc/reverse


Yeah, another commenter mentioned grapheme-clusters, though apparently there's a readily available crate that adds support: https://stackoverflow.com/questions/58770462/how-to-iterate-...

According to this they added it to the standard library at one point but then broke it back out because the lookup tables were too large



.chars() iterates over unicode scalar values, not actual grapheme clusters, so this code would also be incorrect in many circumstances. Ultimately it is incumbent on the developer to understand what their actual unicode-processing requirements are.


I'm not familiar with what exactly falls into each category, though apparently there's an (official?) crate which extends the core str type to support those as well: https://stackoverflow.com/questions/58770462/how-to-iterate-...


this isn't quite set up the way the article was mentioning. Here's with it set up correctly:

https://play.rust-lang.org/?version=stable&mode=debug&editio...

It looks like it still misses some of the cases (specifically with decomposed vs precomposed).

As a side note, it looks like the playground editor also doesn't correctly render some of the lengths (the cursor is 1 character off on the decomposed case)


Huh, I'd never heard of decomposed/precomposed before and I kind of skimmed over them in the original article. Looking at your example, I still can't figure out what the difference is between the two (I guess the Playground editor itself just can't represent the difference?). I'll have to go read up on that concept.

Edit: Having read about them, I guess I assumed that if my keyboard inputs an ë it's actually inputting those two underlying characters. Or to put it differently, I assumed there weren't two separate representations for the same character; that the "decomposed" and "precomposed" versions are one in the same, at the byte level. Learned something h̶o̶r̶r̶i̶f̶y̶i̶n̶g̶ new!


After a little bit of research, it looks like this is how you would get the expected outputs: https://play.rust-lang.org/?version=stable&mode=debug&editio...


What does "table stakes" mean? Never heard that before.


"Baseline expectation". I think it comes from poker or something


Table stakes are a required minimum bet in a game (often poker) - the meaning being if you can’t provide at least this you can’t play. So if a new language doesn’t support Unicode well it won’t get any adoption.


It's a gambling expression meaning the minimum [money] you need to even sit at the table. To be a really competitive player you need more.


By the way, Elixir passes all the tests.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: