mstdn.io is one of the many independent Mastodon servers you can use to participate in the fediverse.

Administered by:

Server stats:

372
active users

Wolf480pl

Just realized it's impossible to use UCS-2 (UTF-16) for passing arguments to unix programs, because arguments are nul-terminated, and in UCS-2, almost every other byte is zero...

@wolf480pl Unix programs? Those are GNU's Not Unix programs sir.

UTF-16 is a useless format, as it's a multibyte encoding that almost doubles the storage size of text, unless all you are encoding is Chinese characters.

Just use UTF-8 - it's ASCII compatible and you can pass it to whatever program and it will work unless the program does something stupid.

If you have some UTF-16 encoded files, you can convert them to UTF-8 with GNU iconv.

@Suiseiseki where the fuck did I say I *want* to use UTF-16?

@Suiseiseki @wolf480pl utf-16 is somewhat faster to decode. Don't even have to be Chinese, it's true even for Cyrillic text and the other half of Latin-1.

But then it's still double the size on everything that's ASCII.

Just use right tools to achieve the goal.
@a1ba @wolf480pl >It's double the size for most things (most things are ASCII)
>It's somewhat faster to decode.
????

Thinking about the differences between the variable encodings of UTF-8 and UTF-16, I don't see how either is meaningfully faster to decode than the other.

@Suiseiseki @wolf480pl @a1ba "It depends". UTF-16 is definitely faster to decode because you have fewer loop iterations for the same string (8bit and 16bit RAM reads are about the same speed on the CPU).

HOWEVER, especially when all codepoints are ASCII, UTF-16 uses twice the memory bandwidth. And that hurts too.

So, ultimately depends on the character set / language used.

@divVerent @Suiseiseki @wolf480pl that's what I meant under "use right tools".

If you're making a word processor, maybe, it makes sense to use UTF-16 just to make decoding text easier for CPU. I'm not sure about exact numbers, but while English might be the most used language in the world, it's definitely not the only one language, many people don't even know it.

@a1ba Honestly I have doubts that UTF-16 is _ever_ the right tool. In almost every situation, either UTF-8 or UTF-32 is gonna outperform it.

Using 4x the RAM for the text in your word processor is likely not even a horrible tradeoff. It's not gonna be bound by it anyway. For handling CJK text that's also full of emoji, it's likely gonna outperform both UTF-16 and UTF-8 - for pure ASCII it's gonna waste I/O bandwidth, but it's gonna be dwarfed by all the _other_ stuff your word processor does anyway.

@divVerent well, maybe you're right. UTF-32 is also a thing, yeah
@divVerent @Suiseiseki @wolf480pl also decoding text that's not ASCII will fall into likely branch. Fewer loop interations, fewer branches.

@a1ba @Suiseiseki @wolf480pl Yeah, the problem with CJK in UTF-16 is that both already have enough codepoints that are outside UTF-16 and need surrogates. So you're gonna get branch prediction fails anyway.

And at this point the BP failures may already outweigh the memory bandwidth cost of UTF-32.

There's really only one good use case for UTF-16 IMHO: when you deal with a LOT of text in different languages, and are also RAM constrained. But even then UTF-8 may perform better.

@a1ba @Suiseiseki @wolf480pl Should maybe add: UTF-8 is kinda a compression scheme for UTF-32. So if you compress your data anyway, it may as well be UTF-32?

Example:

```
-rw-r--r-- 1 rpolzer rpolzer 232724 Mar 29 08:04 pg21000.txt

-rw-r--r-- 1 rpolzer rpolzer 70071 Mar 29 08:09 pg21000-utf-8.bz2
-rw-r--r-- 1 rpolzer rpolzer 70496 Mar 29 08:09 pg21000-utf-16.bz2
-rw-r--r-- 1 rpolzer rpolzer 70754 Mar 29 08:09 pg21000-utf-32.bz2

-rw-r--r-- 1 rpolzer rpolzer 77920 Mar 29 08:05 pg21000-utf-8.xz
-rw-r--r-- 1 rpolzer rpolzer 78820 Mar 29 08:05 pg21000-utf-16.xz
-rw-r--r-- 1 rpolzer rpolzer 80640 Mar 29 08:05 pg21000-utf-32.xz

-rw-r--r-- 1 rpolzer rpolzer 87692 Mar 29 08:09 pg21000-utf-8.gz
-rw-r--r-- 1 rpolzer rpolzer 102435 Mar 29 08:09 pg21000-utf-16.gz
-rw-r--r-- 1 rpolzer rpolzer 117090 Mar 29 08:09 pg21000-utf-32.gz

-rw-r--r-- 1 rpolzer rpolzer 90527 Mar 29 08:07 pg21000-utf-8.zstd
-rw-r--r-- 1 rpolzer rpolzer 110167 Mar 29 08:06 pg21000-utf-16.zstd
-rw-r--r-- 1 rpolzer rpolzer 160791 Mar 29 08:06 pg21000-utf-32.zstd
```

This is on Faust (in German) on Projekt Gutenberg - basically rarely occurring non-ASCII characters.

With `xz -9` and `bzip2 -9`, the differences are negligible, but with `gzip -9` and `zstd` you definitely benefit from the more compact encodings, serving as a "pre-compression".

@divVerent @a1ba @Suiseiseki
no, UTF-8 is more than a compression format for UTF-32.

UTF-8 is carefully designed to be backwards-compatible with ascii. And also self-synchronizing.

Basically, most programs that deal with {ascii + language-specific upper half} will work fine with utf-8 too, as long as you don't try to split / insert characters based on offset or length.

@wolf480pl @a1ba @Suiseiseki Yes, precisely. UTF-8 is more than that, and it also was designed before UTF-32, but instead at a time when people believed we don't need modern day hieroglyphics (i.e. emoji) and that 65536 codepoints might be enough.

Which was a mistake back then already (see also: Han unification).

Do beware though - the most important ASCII algorithm, word wrapping, does NOT work unmodified with UTF-8, even on a fixed width terminal - as many multibyte sequences are whitespace too, and as often 2 or 3 bytes take up just one terminal cell. The real benefit from UTF-8 is more like operating system APIs that see e.g. file names as a string of bytes terminated by a NUL.

Which then some OSes subvert by applying Unicode normalization and case folding to file names... GET RID OF THAT STUFF PLEASE.

@divVerent @a1ba @Suiseiseki yeah, wtf, file names are 0x00-terminated 0x2f-separated sequences of bytes. The kernel should never worry about character sets

@wolf480pl @a1ba @Suiseiseki TBH I disagree with the 0x2f - IMHO that separator should be 0x01 or 0xff in internal APIs. Or maybe one of the four separators in ASCII 0x1c to 0x1f.

But that ship has long sailed.

@divVerent @a1ba @Suiseiseki
idk I like binary formats whose separators concide with printable ASCII characters, such that if your payload is printable ASCII, then the whole file is printable ASCII, but which can store binary payloads too.

For example, bittorrent's bencode.

@divVerent @Suiseiseki @wolf480pl @a1ba utf-16 was propped up by microsoft and sun as a misguided attempt to get out of how unicode turns all string operations from o(1) to o(n). the idea was if you just use 16-bit cells then you are back to being able to just reach at an arbitrary rune.

this is false because diacritics still exist in utf-16. and utf-16 STILL has characters it cannot represent (the ones outside the standard bitmap plane) so you STILL have to perform local checks to see if you are about to slice directly in to a rune at the wrong place.

basically unicode sucks and some corporate coders tried to get around it and made everything suck even more.
@divVerent @Suiseiseki @a1ba @wolf480pl this assumes you care about being correct. if you don't, and evidently companies in current year do not, then SHRUG as long as you normalize the input and confine everything to the BMP then it kind of works.

@icedquinn @Suiseiseki @divVerent @a1ba ok but like

Did Unicode contain surrogates and modifier codepoints at the time when UTF-16 was designed?

@icedquinn @Suiseiseki @divVerent @a1ba
So I can't even blame this on Unicode Consortium's scope creep ;_;

@icedquinn @Suiseiseki @wolf480pl @a1ba TBH diacritics are less of an issue - most operations on strings can easily work on a per-codepoint basis, such as word wrapping - you just need to handle diacritics and other combining codepoints as if they're a word character.

And for stuff like line length computation, you need to take the different per-character width of your font into account anyway.

What's really annoying is string comparing, as you now have to apply a normalization first...

@divVerent @Suiseiseki @wolf480pl @a1ba back when i had that code i basically made several accessors and iterators to deal with it. you told it if you were dealing with graphemes, or just code points, and it had heuristics and loops to check the nearest safe split point at a given byte.

i don't think i have that C code since ages. shame since it would have been neat resume fodder before chatgpt.
@wolf480pl I wonder if you make a special program and call it with execv for example, can you pass binary data as arguments.

@a1ba I don't see how the kernel will be able to copy such argv into the new process's address space.

execv doesn't take argv as a pointer, length pair, or as an iovec, or sth. It's just a have char**. The only way I can think of that a kernel would deal with it is sth like:

char *arg;
while ((arg = *argv++)) {
strncpy(buf, arg, bufspace);
...
}