Just realized it's impossible to use UCS-2 (UTF-16) for passing arguments to unix programs, because arguments are nul-terminated, and in UCS-2, almost every other byte is zero...
@Suiseiseki where the fuck did I say I *want* to use UTF-16?
@Suiseiseki @wolf480pl @a1ba "It depends". UTF-16 is definitely faster to decode because you have fewer loop iterations for the same string (8bit and 16bit RAM reads are about the same speed on the CPU).
HOWEVER, especially when all codepoints are ASCII, UTF-16 uses twice the memory bandwidth. And that hurts too.
So, ultimately depends on the character set / language used.
@a1ba Honestly I have doubts that UTF-16 is _ever_ the right tool. In almost every situation, either UTF-8 or UTF-32 is gonna outperform it.
Using 4x the RAM for the text in your word processor is likely not even a horrible tradeoff. It's not gonna be bound by it anyway. For handling CJK text that's also full of emoji, it's likely gonna outperform both UTF-16 and UTF-8 - for pure ASCII it's gonna waste I/O bandwidth, but it's gonna be dwarfed by all the _other_ stuff your word processor does anyway.
@a1ba @Suiseiseki @wolf480pl Yeah, the problem with CJK in UTF-16 is that both already have enough codepoints that are outside UTF-16 and need surrogates. So you're gonna get branch prediction fails anyway.
And at this point the BP failures may already outweigh the memory bandwidth cost of UTF-32.
There's really only one good use case for UTF-16 IMHO: when you deal with a LOT of text in different languages, and are also RAM constrained. But even then UTF-8 may perform better.
@a1ba @Suiseiseki @wolf480pl Should maybe add: UTF-8 is kinda a compression scheme for UTF-32. So if you compress your data anyway, it may as well be UTF-32?
Example:
```
-rw-r--r-- 1 rpolzer rpolzer 232724 Mar 29 08:04 pg21000.txt
-rw-r--r-- 1 rpolzer rpolzer 70071 Mar 29 08:09 pg21000-utf-8.bz2
-rw-r--r-- 1 rpolzer rpolzer 70496 Mar 29 08:09 pg21000-utf-16.bz2
-rw-r--r-- 1 rpolzer rpolzer 70754 Mar 29 08:09 pg21000-utf-32.bz2
-rw-r--r-- 1 rpolzer rpolzer 77920 Mar 29 08:05 pg21000-utf-8.xz
-rw-r--r-- 1 rpolzer rpolzer 78820 Mar 29 08:05 pg21000-utf-16.xz
-rw-r--r-- 1 rpolzer rpolzer 80640 Mar 29 08:05 pg21000-utf-32.xz
-rw-r--r-- 1 rpolzer rpolzer 87692 Mar 29 08:09 pg21000-utf-8.gz
-rw-r--r-- 1 rpolzer rpolzer 102435 Mar 29 08:09 pg21000-utf-16.gz
-rw-r--r-- 1 rpolzer rpolzer 117090 Mar 29 08:09 pg21000-utf-32.gz
-rw-r--r-- 1 rpolzer rpolzer 90527 Mar 29 08:07 pg21000-utf-8.zstd
-rw-r--r-- 1 rpolzer rpolzer 110167 Mar 29 08:06 pg21000-utf-16.zstd
-rw-r--r-- 1 rpolzer rpolzer 160791 Mar 29 08:06 pg21000-utf-32.zstd
```
This is on Faust (in German) on Projekt Gutenberg - basically rarely occurring non-ASCII characters.
With `xz -9` and `bzip2 -9`, the differences are negligible, but with `gzip -9` and `zstd` you definitely benefit from the more compact encodings, serving as a "pre-compression".
@divVerent @a1ba @Suiseiseki
no, UTF-8 is more than a compression format for UTF-32.
UTF-8 is carefully designed to be backwards-compatible with ascii. And also self-synchronizing.
Basically, most programs that deal with {ascii + language-specific upper half} will work fine with utf-8 too, as long as you don't try to split / insert characters based on offset or length.
@wolf480pl @a1ba @Suiseiseki Yes, precisely. UTF-8 is more than that, and it also was designed before UTF-32, but instead at a time when people believed we don't need modern day hieroglyphics (i.e. emoji) and that 65536 codepoints might be enough.
Which was a mistake back then already (see also: Han unification).
Do beware though - the most important ASCII algorithm, word wrapping, does NOT work unmodified with UTF-8, even on a fixed width terminal - as many multibyte sequences are whitespace too, and as often 2 or 3 bytes take up just one terminal cell. The real benefit from UTF-8 is more like operating system APIs that see e.g. file names as a string of bytes terminated by a NUL.
Which then some OSes subvert by applying Unicode normalization and case folding to file names... GET RID OF THAT STUFF PLEASE.
@divVerent @a1ba @Suiseiseki yeah, wtf, file names are 0x00-terminated 0x2f-separated sequences of bytes. The kernel should never worry about character sets
@wolf480pl @a1ba @Suiseiseki TBH I disagree with the 0x2f - IMHO that separator should be 0x01 or 0xff in internal APIs. Or maybe one of the four separators in ASCII 0x1c to 0x1f.
But that ship has long sailed.
@divVerent @a1ba @Suiseiseki
idk I like binary formats whose separators concide with printable ASCII characters, such that if your payload is printable ASCII, then the whole file is printable ASCII, but which can store binary payloads too.
For example, bittorrent's bencode.
@icedquinn @Suiseiseki @divVerent @a1ba ok but like
Did Unicode contain surrogates and modifier codepoints at the time when UTF-16 was designed?
@icedquinn @Suiseiseki @divVerent @a1ba
So I can't even blame this on Unicode Consortium's scope creep ;_;
@icedquinn @Suiseiseki @wolf480pl @a1ba TBH diacritics are less of an issue - most operations on strings can easily work on a per-codepoint basis, such as word wrapping - you just need to handle diacritics and other combining codepoints as if they're a word character.
And for stuff like line length computation, you need to take the different per-character width of your font into account anyway.
What's really annoying is string comparing, as you now have to apply a normalization first...
@a1ba I don't see how the kernel will be able to copy such argv into the new process's address space.
execv doesn't take argv as a pointer, length pair, or as an iovec, or sth. It's just a have char**. The only way I can think of that a kernel would deal with it is sth like:
char *arg;
while ((arg = *argv++)) {
strncpy(buf, arg, bufspace);
...
}