To be a bit more explicit: Unicode is a character encoding, to 20-and-a-half-bit 'bytes', that is variable-width in those 'bytes', even before considering how the 'bytes' are encoded to actual bytes. Eg "ψ̊" (greek small psi with ring above) is U+3C8 U+30A (two 'bytes').
To be technical, by "character" I mean "user-perceived character" or (in Unicode speak) "extended grapheme cluster". This is the thing a user will think of as one character when looking at it on their screen.
A code point is the atomic unit of the abstract Unicode encoding. By "abstract" I mean it is not an actual text encoding you can write to a file.
A code unit is the atomic unit of an actual text encoding, such as UTF-8, UTF-16LE or UTF-32LE (and their BE equivalents).
---
So to put it together a "user-perceived character" is made up of one or more "code points". When implemented in an application, each "code point" is encoded using one or more "code units".
> One byte = one "character" makes for much easier programming.
Only if you are naively operating in the Anglosphere / world where the most complex thing you have to handle is larger character sets. In reality, there's ligatures, diacritics, combining characters, RTL, nbsp, locales, and emoji (with skin tones!). Not to mention legacy encoding.
And no, it does not use a "small fraction of memory and storage" in a huge range of applications, to the point where some regions have transcoding proxies still.
This is not about covering ALL of Unicode. This is about starting to cover Unicode.
"Anglosphere" would be just 7(&"8") bit ASCII, and it's the current situation where it takes quite a lot of skill and knowledge just to start learning how to properly deal with Unicode, because it's often not even taught !
IMHO 32-bit bytes would help tremendously with onboarding developers into Unicode, because it would force dumping ASCII-only as the starting point (and sadly, often ending point) for teaching how to deal with text.
And who can blame the teachers, Unicode is already hard enough without even having to deal with the difficulties coming from having to explain its multi-byte representation...
Last but not least : this would have forced standardization between the Unix world now on UTF-8 and the Windows world which is still stuck on UTF-16 (and Windows-1252 ?!?) for some of the core functions like filenames, which, for instance, still regularly results in issues working with files with non-ASCII filenames.
Not all user-perceived characters can be represented as a single Unicode codepoint. Hence, Unicode text encodings (almost[1]) always have to be treated as variable length, even UTF-32.
[1] at runtime, you could dynamically assign 'virtual' codepoints to grapheme clusters and get a fixed-length encoding for strings that way
Every time I see one of these threads, my gratitude to only do backend grows. Human behavior is too complex, let the webdevs handle UI, and human languages are too complex, not sure what speciality handles that. Give me out of order packets and parsing code that skips a character if the packet length lines up just so any day.
I am thankful that almost all the Unicode text I see is rendered properly now, farewell the little boxes. Good job lots of people.
I think we really have the iPhone jailbreakers to thank for that. U.S. developers were allergic almost offended by anything that wasn't ASCII and then someone released an app that unlocked the emoji icons that Apple had originally intended only for Japan. Emoji is defined in the astral planes so almost nothing at the time was capable of understanding them, yet were so irresistible that developers worldwide who would otherwise have done nothing to address their cultural biases immediately fixed everything overnight to have them. So thanks to cartoons, we now have a more inclusive world.
There's supporting Unicode, and 'supporting' Unicode. If you're only dealing with western languages, it's easy to fall into the trap of only 'supporting' Unicode. Proper emoji handling will put things like grapheme clusters and zero-width joiners on your map.
Anyway, it doesn't make much sense to define the size of a “byte“ as anything else then 8 bits, because that's the smallest adressable memory unit. If you need a 32 bit data type, just use one!
My very point is that we should have increased the size of the smallest addressable memory unit from 8 to 32 bits, increased again, as previous computer architectures used from 4 to 7 bits per byte. (There might be still e-mail servers around directly compatible with "non-padded" 7-bit ASCII ?)
(You'll also notice that caring about not wasting the 8th bit with ASCII has lead us into all sorts of issues... and why care so much about it when as soon as data density becomes important, we can use compression which AFAIK easily rids us of padding ?)
You're basically arguing against variable width text encodings - which is ok. But you know, it's entirely possible to use UTF32. In fact, some programming languages use it by default to represent strings.
But again and again, all of this has nothing to do with the size of a byte.
BTW, are you aware that 8-bit Microcontrollers are still in widespread use and nowhere near of being discontinued?
Static width text encoding + Unicode = Cannot fit a "character" in a single octet, which currently is the default addressable unit of storage/memory.
Programming microcontrollers isn't considered to be "mandatory computer literacy" in college, while basic scripting, which involves understanding how text is encoded at the storage/memory level - is.
Again and again and again, a byte is not meant to hold a text character. Also, as the sibling parent has pointed out, fixed width encoding only gets you so far because it doesn't help with grapheme clusters. That's probably why the world has basically settled with UTF8: it saves memory and destroys any notion that every abstract text character can somehow be represented by a single number.
> mandatory computer literacy
I don't understand why you keep bringing up this phrase and ignore a huge part of real world computing. College students should simply learn how Unicode works. Are you seriously demanding that CPU designers should change their chip design instead?
Generally, I think you are conflating/confusing the concept of “byte“ (= smallest unit of memory) with the concept of “character“ resp. “code unit“ (= smallest unit of text encoding). The size of the former depends on the CPU architecture and on modern systems it's always 8 bits. The size of the latter depends on the specific text encoding.
People were holding off on transitioning because pointers use twice as much space in x64. If bytes had quadrupled in space with x64 we would still be using 32 bit software everywhere
MIDI (8 bit), 16 bit PCM, 24 PCM and basically any compressed data format (which is always byte based, because the idea is to save memory). You obviously don't care about memory, but many people do!
To start with, RGB (and I assume Y′CbCr ?) can be encoded in many different ways. The most common one today (still) uses 8 bits per channel, meaning that a separated 1-octet value can only define monochrome. Therefore 8-bpc RGB is a 24-bit sized format, not a 8-bit sized data one.
And, by an interesting coincidence, with the arrival of "HDR", 8-bit per channel is slowly becoming obsolete (because insufficient). The next "step" is 10-bit per channel with 3 channels (hence "HDR10(+)"), and so should fit quite well in 32 bits ?
(However, it would seem that even Dolby's Perceptual Quantizer transfer function might need 12 bits per channel to avoid banding over the "HDR" Rec.2020/2100-sized color gamut..?)
We're debating semantics, but if I reshaped an RGB image into component arrays i.e. u8[yn][xn][3] → u8[3][yn][xn] then would you still view that as a 24-bit format? What if those 24-bit values were huffman or run-length encoded would it be an n-bit format? If your Y′CbCr luminance plane has a legal range of 16..235 and the chrominance planes are 16..240, then would it be a 23.40892 bit format?
I'm arguing about non-compressed, eventually padded data types that make learning Unicode (or any other applicable data format) easier because of the equivalence : 1 atomic unit ("character", pixel) = 1 smallest addressable unit of memory (byte). This involves byte size being at least as large as atom size.
And it's particularly important to have this property for text, because not only data is overwhelmingly stored as text (in importance, not by "weight"), but because computer programs themselves are written using text.
Well I figured since you feel strongly about using a type wider than 8-bits for RGB you must have a really good display that actually lets you perceive the colors that enables you to encode. Most PC displays are garbage including the expensive ones because first, sRGB only specifies a very small portion of light that's perceivable and secondly, any display maker who builds something better is going to run into complaints about how terrible netflix looks, because it reveals things like banding (which you mentioned) that otherwise wouldn't be perceivable. So I was hoping you could recommend me a better monitor, so I can get into >8 bit RGB, because I've found it exceedingly difficult to shop around for this kind of thing.
Ok, so you weren't sarcastic and/or misunderstanding my use of "atomic".
Sadly, I kind of gave up on getting a "HDR" display, at least for now, because :
- AFAIK neither Linux nor Windows have good enough "HDR" support yet. (MacOS supposedly does, but I'm not interested.)
- I'm happy enough with my HP LP2475w which I got for dirt cheap just before "HDR" became a thing. I consider the 1920x1200 resolution to be perfect for now (as a bonus I can manually scale various old resolutions like 800x600 to be pixel-perfect) - too many programs/OSes still have issues with auto-scaling programs on higher resolution screens (which would come with "HDR"). I'm also particularly fond of the 16:10 ratio, which seems to have gone extinct.
- Maybe I'll be able to run this monitor properly in wide gamuts (though with banding), or maybe even in some kind of ""HDR" compatibility mode", though it would seem that the current sellers of "HDR" screens aren't going to make that easy. I might be able to get a colorimeter soon to properly calibrate it.
If you have a $200 monitor then it probably struggles to make proper use of 8-bit formats. I have a display that claims to simulate DICOM but it's not enough I want more. However I'm not willing to spend $3000 on a display which doesn't have engineering specs and then send it back because it doesn't work. I don't care about resolution. I care about being able to see the unseen. I care about edge cases like yellow and blue making pink. That was the first significant finding Maxwell reported on when he invented RGB. However nearly every monitor ever made mixes those two colors wrong, as gray, due to subpixel layout issues. Nearly every scaling algorithm mixes those two colors wrong too, due to the way sRGB was designed. It's amazing how poorly color is modeled on personal computers. https://justine.lol/maxwell.png
Well, when released in 2008 it was a $600 monitor, I got it second-hand for 80€.
I'm not sure what DICOM has to do with color reproduction quality ? Also it seems to be a quite a bit older standard than sRGB...
By definition, you can't "see the unseen". "Yellow" and "blue" are opponent "colors", so, by definition, a proper mixture of them is going to give you grey :
Also, when talking about subtle color effects, you have to consider that personal variation might come into play (for instance red-green "colorblindness" is a spectrum).
It looks like this thing is the thing I want to buy https://www.apple.com/pro-display-xdr/ There's plenty of light that is currently unseeable. Look at chromaticity chart for sRGB. If your definition of color mixes yellow and blue and as grey then you've defined color wrong, because nature has a different definition where it's pink. For example the CIELAB colorspace will mix the two as pink. Also I'm not colorblind. If I'm on the spectrum I would be on the able to see more color more accurately end of the spectrum. Although when designing charts I'm very good at choosing colors that accommodate people who are colorblind, while still looking stylish, because I feel like inclusive technology is important.
Not really. Strings are a list of integers [1], integers are signed and fill a system word, but there's also 4 bits of type information. So you can have a 28-bit signed integer char on a 32-bit system or a signed 60-bit integer.
However, since Unicode is limited to 21-bits by utf-16 encoding, a unicode code point will fit in a small integer.
[1] unless you use binaries, which is often a better choice.