https://news.ycombinator.com/item?id=27086928 These various ways to encode Unico...

spacechild1 · on May 9, 2021

But Unicode itself doesn't!

Anyway, it doesn't make much sense to define the size of a “byte“ as anything else then 8 bits, because that's the smallest adressable memory unit. If you need a 32 bit data type, just use one!

BlueTemplar · on May 10, 2021

My very point is that we should have increased the size of the smallest addressable memory unit from 8 to 32 bits, increased again, as previous computer architectures used from 4 to 7 bits per byte. (There might be still e-mail servers around directly compatible with "non-padded" 7-bit ASCII ?)

jart · on May 12, 2021

https://datatracker.ietf.org/doc/html/rfc4042

spacechild1 · on May 10, 2021

But why? Just so that we need to do more bit twiddling and waste memory?

Again, bytes are not foremost about text. We habe to deal with all sorts of data, many of which is shorter than 32 bits.

You can always pick a larger data type for your type of work, but not the opposite.

BlueTemplar · on May 11, 2021

Because these days it's critical for "basic computer literacy" :

https://news.ycombinator.com/item?id=27094663

https://news.ycombinator.com/item?id=27104860

(You'll also notice that caring about not wasting the 8th bit with ASCII has lead us into all sorts of issues... and why care so much about it when as soon as data density becomes important, we can use compression which AFAIK easily rids us of padding ?)

spacechild1 · on May 11, 2021

You're basically arguing against variable width text encodings - which is ok. But you know, it's entirely possible to use UTF32. In fact, some programming languages use it by default to represent strings.

But again and again, all of this has nothing to do with the size of a byte.

BTW, are you aware that 8-bit Microcontrollers are still in widespread use and nowhere near of being discontinued?

BlueTemplar · on May 12, 2021

Static width text encoding + Unicode = Cannot fit a "character" in a single octet, which currently is the default addressable unit of storage/memory.

Programming microcontrollers isn't considered to be "mandatory computer literacy" in college, while basic scripting, which involves understanding how text is encoded at the storage/memory level - is.

spacechild1 · on May 13, 2021

Again and again and again, a byte is not meant to hold a text character. Also, as the sibling parent has pointed out, fixed width encoding only gets you so far because it doesn't help with grapheme clusters. That's probably why the world has basically settled with UTF8: it saves memory and destroys any notion that every abstract text character can somehow be represented by a single number.

> mandatory computer literacy

I don't understand why you keep bringing up this phrase and ignore a huge part of real world computing. College students should simply learn how Unicode works. Are you seriously demanding that CPU designers should change their chip design instead?

jart · on May 12, 2021

UTF32 is a variable length encoding if we consider combining characters.

spacechild1 · on May 9, 2021

Generally, I think you are conflating/confusing the concept of “byte“ (= smallest unit of memory) with the concept of “character“ resp. “code unit“ (= smallest unit of text encoding). The size of the former depends on the CPU architecture and on modern systems it's always 8 bits. The size of the latter depends on the specific text encoding.