We really should have moved to 32 bit bytes when moving to 64 bit words. Would h...

ChrisSD · on May 8, 2021

Not really. Unicode is a variable width abstract encoding; a single character can be made up of multiple code points.

For Unicode, 32-bit bytes would be an incredibly wasteful in memory encoding.

a1369209993 · on May 9, 2021

> Unicode is a variable width abstract encoding;

To be a bit more explicit: Unicode is a character encoding, to 20-and-a-half-bit 'bytes', that is variable-width in those 'bytes', even before considering how the 'bytes' are encoded to actual bytes. Eg "ψ̊" (greek small psi with ring above) is U+3C8 U+30A (two 'bytes').

spacechild1 · on May 9, 2021

Unicode is not a text encoding. UTF8, UTF16, UTF32, etc. are text encodings.

> a single character can be made up of multiple code points.

It's really the other way round...

spacechild1 · on May 9, 2021

Or maybe you meant to say: A single abstract character (or code point) can be made up of multiple code units.

Unfortunately, the term “character“ alone is ambiguous because depending on the context it can refer to either code points or code units.

ChrisSD · on May 9, 2021

To be technical, by "character" I mean "user-perceived character" or (in Unicode speak) "extended grapheme cluster". This is the thing a user will think of as one character when looking at it on their screen.

A code point is the atomic unit of the abstract Unicode encoding. By "abstract" I mean it is not an actual text encoding you can write to a file.

A code unit is the atomic unit of an actual text encoding, such as UTF-8, UTF-16LE or UTF-32LE (and their BE equivalents).

---

So to put it together a "user-perceived character" is made up of one or more "code points". When implemented in an application, each "code point" is encoded using one or more "code units".

spacechild1 · on May 9, 2021

Now that's a good summary! When talking about Unicode, some extra clarity can never hurt.

BlueTemplar · on May 8, 2021

One byte = one "character" makes for much easier programming.

Text generally uses a small fraction of memory and storage these days.

kortex · on May 8, 2021

> One byte = one "character" makes for much easier programming.

Only if you are naively operating in the Anglosphere / world where the most complex thing you have to handle is larger character sets. In reality, there's ligatures, diacritics, combining characters, RTL, nbsp, locales, and emoji (with skin tones!). Not to mention legacy encoding.

And no, it does not use a "small fraction of memory and storage" in a huge range of applications, to the point where some regions have transcoding proxies still.

BlueTemplar · on May 9, 2021

This is not about covering ALL of Unicode. This is about starting to cover Unicode.

"Anglosphere" would be just 7(&"8") bit ASCII, and it's the current situation where it takes quite a lot of skill and knowledge just to start learning how to properly deal with Unicode, because it's often not even taught !

IMHO 32-bit bytes would help tremendously with onboarding developers into Unicode, because it would force dumping ASCII-only as the starting point (and sadly, often ending point) for teaching how to deal with text.

And who can blame the teachers, Unicode is already hard enough without even having to deal with the difficulties coming from having to explain its multi-byte representation...

Last but not least : this would have forced standardization between the Unix world now on UTF-8 and the Windows world which is still stuck on UTF-16 (and Windows-1252 ?!?) for some of the core functions like filenames, which, for instance, still regularly results in issues working with files with non-ASCII filenames.

cygx · on May 8, 2021

Not all user-perceived characters can be represented as a single Unicode codepoint. Hence, Unicode text encodings (almost[1]) always have to be treated as variable length, even UTF-32.

[1] at runtime, you could dynamically assign 'virtual' codepoints to grapheme clusters and get a fixed-length encoding for strings that way

jart · on May 8, 2021

Even the individual unicode codepoints themselves are variable width if we consider that things like cjk and emoji take up >1 monospace cells.

lanstin · on May 8, 2021

Every time I see one of these threads, my gratitude to only do backend grows. Human behavior is too complex, let the webdevs handle UI, and human languages are too complex, not sure what speciality handles that. Give me out of order packets and parsing code that skips a character if the packet length lines up just so any day.

I am thankful that almost all the Unicode text I see is rendered properly now, farewell the little boxes. Good job lots of people.

jart · on May 8, 2021

I think we really have the iPhone jailbreakers to thank for that. U.S. developers were allergic almost offended by anything that wasn't ASCII and then someone released an app that unlocked the emoji icons that Apple had originally intended only for Japan. Emoji is defined in the astral planes so almost nothing at the time was capable of understanding them, yet were so irresistible that developers worldwide who would otherwise have done nothing to address their cultural biases immediately fixed everything overnight to have them. So thanks to cartoons, we now have a more inclusive world.

londons_explore · on May 8, 2021

I'm pretty sure Unicode was pretty widespread before the iphone/emoji popularity.

cygx · on May 8, 2021

There's supporting Unicode, and 'supporting' Unicode. If you're only dealing with western languages, it's easy to fall into the trap of only 'supporting' Unicode. Proper emoji handling will put things like grapheme clusters and zero-width joiners on your map.

spacechild1 · on May 8, 2021

You know, bytes are not only about text, they are also used to represent binary data...

Not to mention that bytes have nothing to do with unicode. Unicode codepoints can be encoded in many different ways: UTF8, UTF16, UTF32, etc.

BlueTemplar · on May 9, 2021

https://news.ycombinator.com/item?id=27086928

These various ways to encode Unicode have quite a lot to do with bytes being 8-bit sized !

spacechild1 · on May 9, 2021

But Unicode itself doesn't!

Anyway, it doesn't make much sense to define the size of a “byte“ as anything else then 8 bits, because that's the smallest adressable memory unit. If you need a 32 bit data type, just use one!

BlueTemplar · on May 10, 2021

My very point is that we should have increased the size of the smallest addressable memory unit from 8 to 32 bits, increased again, as previous computer architectures used from 4 to 7 bits per byte. (There might be still e-mail servers around directly compatible with "non-padded" 7-bit ASCII ?)

jart · on May 12, 2021

https://datatracker.ietf.org/doc/html/rfc4042

spacechild1 · on May 10, 2021

But why? Just so that we need to do more bit twiddling and waste memory?

Again, bytes are not foremost about text. We habe to deal with all sorts of data, many of which is shorter than 32 bits.

You can always pick a larger data type for your type of work, but not the opposite.

BlueTemplar · on May 11, 2021

Because these days it's critical for "basic computer literacy" :

https://news.ycombinator.com/item?id=27094663

https://news.ycombinator.com/item?id=27104860

(You'll also notice that caring about not wasting the 8th bit with ASCII has lead us into all sorts of issues... and why care so much about it when as soon as data density becomes important, we can use compression which AFAIK easily rids us of padding ?)

spacechild1 · on May 11, 2021

You're basically arguing against variable width text encodings - which is ok. But you know, it's entirely possible to use UTF32. In fact, some programming languages use it by default to represent strings.

But again and again, all of this has nothing to do with the size of a byte.

BTW, are you aware that 8-bit Microcontrollers are still in widespread use and nowhere near of being discontinued?

BlueTemplar · on May 12, 2021

Static width text encoding + Unicode = Cannot fit a "character" in a single octet, which currently is the default addressable unit of storage/memory.

Programming microcontrollers isn't considered to be "mandatory computer literacy" in college, while basic scripting, which involves understanding how text is encoded at the storage/memory level - is.

spacechild1 · on May 13, 2021

Again and again and again, a byte is not meant to hold a text character. Also, as the sibling parent has pointed out, fixed width encoding only gets you so far because it doesn't help with grapheme clusters. That's probably why the world has basically settled with UTF8: it saves memory and destroys any notion that every abstract text character can somehow be represented by a single number.

> mandatory computer literacy

I don't understand why you keep bringing up this phrase and ignore a huge part of real world computing. College students should simply learn how Unicode works. Are you seriously demanding that CPU designers should change their chip design instead?

jart · on May 12, 2021

UTF32 is a variable length encoding if we consider combining characters.

spacechild1 · on May 9, 2021

Generally, I think you are conflating/confusing the concept of “byte“ (= smallest unit of memory) with the concept of “character“ resp. “code unit“ (= smallest unit of text encoding). The size of the former depends on the CPU architecture and on modern systems it's always 8 bits. The size of the latter depends on the specific text encoding.

wongarsu · on May 8, 2021

People were holding off on transitioning because pointers use twice as much space in x64. If bytes had quadrupled in space with x64 we would still be using 32 bit software everywhere

BlueTemplar · on May 8, 2021

Well, obviously it would have delayed the transition. However you can only go so far with 4Go-limited memory.

And do you have examples of still widely used 8-bit sized data formats ?

owl57 · on May 8, 2021

I assume you wrote this comment in UTF-8 over HTTP (ASCII-based) and TLS (lots of uint8 fields).

BlueTemplar · on May 9, 2021

To clarify : I'm more concerned about "final" data formats, less about "transport" ones, which need much longer legacy support.

Narishma · on May 8, 2021

You can go very far with just 4GB of memory, especially when not using wasteful software.

spacechild1 · on May 9, 2021

MIDI (8 bit), 16 bit PCM, 24 PCM and basically any compressed data format (which is always byte based, because the idea is to save memory). You obviously don't care about memory, but many people do!

BlueTemplar · on May 9, 2021

But compressed data formats aren't going to care about byte size for this very reason...

spacechild1 · on May 9, 2021

Ok, bad example.

But still, 'byte' refers to the smallest addressable unit of memory. There's just no point in arguing over its size...

jart · on May 8, 2021

RGB and Y′CbCr

BlueTemplar · on May 9, 2021

To start with, RGB (and I assume Y′CbCr ?) can be encoded in many different ways. The most common one today (still) uses 8 bits per channel, meaning that a separated 1-octet value can only define monochrome. Therefore 8-bpc RGB is a 24-bit sized format, not a 8-bit sized data one.

And, by an interesting coincidence, with the arrival of "HDR", 8-bit per channel is slowly becoming obsolete (because insufficient). The next "step" is 10-bit per channel with 3 channels (hence "HDR10(+)"), and so should fit quite well in 32 bits ?

(However, it would seem that even Dolby's Perceptual Quantizer transfer function might need 12 bits per channel to avoid banding over the "HDR" Rec.2020/2100-sized color gamut..?)

jart · on May 9, 2021

We're debating semantics, but if I reshaped an RGB image into component arrays i.e. u8[yn][xn][3] → u8[3][yn][xn] then would you still view that as a 24-bit format? What if those 24-bit values were huffman or run-length encoded would it be an n-bit format? If your Y′CbCr luminance plane has a legal range of 16..235 and the chrominance planes are 16..240, then would it be a 23.40892 bit format?

BlueTemplar · on May 10, 2021

I'm arguing about non-compressed, eventually padded data types that make learning Unicode (or any other applicable data format) easier because of the equivalence : 1 atomic unit ("character", pixel) = 1 smallest addressable unit of memory (byte). This involves byte size being at least as large as atom size.

And it's particularly important to have this property for text, because not only data is overwhelmingly stored as text (in importance, not by "weight"), but because computer programs themselves are written using text.

jart · on May 11, 2021

Can you recommend me a good PC computer display at any cost that has an objectively good gamut so I can see what you see?

BlueTemplar · on May 11, 2021

I'm sorry, I'm not sure that I understand ?

jart · on May 12, 2021

Well I figured since you feel strongly about using a type wider than 8-bits for RGB you must have a really good display that actually lets you perceive the colors that enables you to encode. Most PC displays are garbage including the expensive ones because first, sRGB only specifies a very small portion of light that's perceivable and secondly, any display maker who builds something better is going to run into complaints about how terrible netflix looks, because it reveals things like banding (which you mentioned) that otherwise wouldn't be perceivable. So I was hoping you could recommend me a better monitor, so I can get into >8 bit RGB, because I've found it exceedingly difficult to shop around for this kind of thing.

BlueTemplar · on May 12, 2021

Ok, so you weren't sarcastic and/or misunderstanding my use of "atomic".

Sadly, I kind of gave up on getting a "HDR" display, at least for now, because :

- AFAIK neither Linux nor Windows have good enough "HDR" support yet. (MacOS supposedly does, but I'm not interested.)

- I'm happy enough with my HP LP2475w which I got for dirt cheap just before "HDR" became a thing. I consider the 1920x1200 resolution to be perfect for now (as a bonus I can manually scale various old resolutions like 800x600 to be pixel-perfect) - too many programs/OSes still have issues with auto-scaling programs on higher resolution screens (which would come with "HDR"). I'm also particularly fond of the 16:10 ratio, which seems to have gone extinct.

- Maybe I'll be able to run this monitor properly in wide gamuts (though with banding), or maybe even in some kind of ""HDR" compatibility mode", though it would seem that the current sellers of "HDR" screens aren't going to make that easy. I might be able to get a colorimeter soon to properly calibrate it.

jart · on May 12, 2021

If you have a $200 monitor then it probably struggles to make proper use of 8-bit formats. I have a display that claims to simulate DICOM but it's not enough I want more. However I'm not willing to spend $3000 on a display which doesn't have engineering specs and then send it back because it doesn't work. I don't care about resolution. I care about being able to see the unseen. I care about edge cases like yellow and blue making pink. That was the first significant finding Maxwell reported on when he invented RGB. However nearly every monitor ever made mixes those two colors wrong, as gray, due to subpixel layout issues. Nearly every scaling algorithm mixes those two colors wrong too, due to the way sRGB was designed. It's amazing how poorly color is modeled on personal computers. https://justine.lol/maxwell.png

BlueTemplar · on May 13, 2021

Well, when released in 2008 it was a $600 monitor, I got it second-hand for 80€.

I'm not sure what DICOM has to do with color reproduction quality ? Also it seems to be a quite a bit older standard than sRGB...

By definition, you can't "see the unseen". "Yellow" and "blue" are opponent "colors", so, by definition, a proper mixture of them is going to give you grey :

https://www.handprint.com/HP/WCL/color2.html#opponentfunctio...

Also, when talking about subtle color effects, you have to consider that personal variation might come into play (for instance red-green "colorblindness" is a spectrum).

jart · on May 13, 2021

It looks like this thing is the thing I want to buy https://www.apple.com/pro-display-xdr/ There's plenty of light that is currently unseeable. Look at chromaticity chart for sRGB. If your definition of color mixes yellow and blue and as grey then you've defined color wrong, because nature has a different definition where it's pink. For example the CIELAB colorspace will mix the two as pink. Also I'm not colorblind. If I'm on the spectrum I would be on the able to see more color more accurately end of the spectrum. Although when designing charts I'm very good at choosing colors that accommodate people who are colorblind, while still looking stylish, because I feel like inclusive technology is important.

jart · on May 8, 2021

Use Erlang. It has 32-bit char.

toast0 · on May 8, 2021

Not really. Strings are a list of integers [1], integers are signed and fill a system word, but there's also 4 bits of type information. So you can have a 28-bit signed integer char on a 32-bit system or a signed 60-bit integer.

However, since Unicode is limited to 21-bits by utf-16 encoding, a unicode code point will fit in a small integer.

[1] unless you use binaries, which is often a better choice.