Follow

If you were design the worst possible file format, one which maximizes the number of bugs in its parsers, what features would it include?

Let me start:
Length-prefixed everything. Containers (lists, dicts, etc) have a length field in bytes, but all the items inside them also have their own length fields.

@wolf480pl You basically just described the SSL/TLS encoding…
@minus @wolf480pl Nah, that's for certificates IIRC, I mean the clientHello and serverHello and maybe the rest of the packets.

@lanodan
I'be heard ASN.1 does that, too, and that's where I got my insporation.

But we can do worse than that. Off the top of my head I can think of at least 3 more design decisions that would make the format way more difficult to parse, but I want yall to come up with something too.
@minus

@wolf480pl Python level whitespace rules, instead of { you can also plainly write the syntax in English and use synonyms of the words throughout each other, same with " and '. Names for your types will also be automatically converted to the same thing

braceopen
Shape
color colon quotationmark white quotationmark

Shape
colour: "white'
bracketclose

@wolf480pl I'll make it "human readable". No two implementations would parse the file in the same way because the spec would contain a lot of unclear corner cases.

@jerky What do you mean by "human readable"?

As for edge cases - believe me, we're all about them. The question is, what type of edge cases you would include / how you would create them.

@wolf480pl
Like YAML. It aims to be human friendly: you may omit quotes for strings, boolean values can be specified as various English words (true/yes/on, false/no/off), etc. But as a result it is so complex both for humans [1] (ironically), and parsers [2].

[1]: arp242.net/yaml-config.html#su
[2]: matrix.yaml.io/valid.html

@jerky oh, you mean mandating coercion of some errors to non-errors? Sounds fun :D

@wolf480pl What is wrong with length prefixes? They seem easy to implement.

The only thing that I can think of atm is having nested structures. https://github.com/lovasoa/bad_json_parsers

@Mikoto yes, arbitrary nesting! I was waiting for someone to say this.

@Mikoto
Now as for what's wrong with length fields:
If you have nested (even 1-level nested) length fields, like I described, it's unclear what you should do if:
- the sum of inner length fields is smaller than the outer length
- the sum of inner length fields is greater than the outer length

Moreover, a parser may not even check for that. In which case you'll get different results depending on whether it skipped over the outer thing or started traversing the inside of it.

@Mikoto So yeah, this is easy to implement. Easy to implement wrong. And very hard to implement right, especially when the standard doesn't specify what should happen when the lengths don't sum up.

@wolf480pl @Mikoto Capnproto has length fields in it, but doesn't have these problems, since it uses "pointers" for nesting.

@ignaloidas @Mikoto
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA!!!!

@wolf480pl @Mikoto Works pretty good, not gonna lie. It's a schema defined encoding, so usually just checking if pointers are in-bounds is enough.

@ignaloidas @Mikoto
well, tbh. most filesystems work like this... but I'd still avoid pointers unless absolutely necessary.
Are the pointers relative or absolute?

@ignaloidas @Mikoto
so yeah, for the "worst format competition", pointers are definitely a good suggestion, as they can interact in weird ways with other features of the format when not used carefully

@ignaloidas @Mikoto
Like, for example, you could have a pointer into the middle of some structure, so that the same range of bytes can be interpreted in two different ways...

which reminds me of isohybrid

@wolf480pl @Mikoto Relative withing segment, absolute between segments.

@wolf480pl @Mikoto I'd note that this is one of the least cursed thing about it. It says that objects with default values should equal to zeroes. And it has default values. So to get an actual value of a field you have to XOR it with the default value.

@wolf480pl @Mikoto Exploiting this, List type says that if all objects have empty space in the ending, it can be ignored for tighter packing. There is a neat DoS where you give a list that has MAXNUM of 0-length items. Which is completely legal, it's just full of default value items.

@wolf480pl @Mikoto Oh, also forgot to mention, length is encoded in the pointer. And because all-zeroes objects are legal, and you can cut-off zeroes from the end of objects, pointer to an all-default object points to itself and has 0 length.

@ignaloidas
I hope Sony or Nintendo uses it in their console, so that we can have a C3 talk describing how it was thoroughly pwned.
@Mikoto

@Mikoto @wolf480pl It then can use this neat compression scheme which erases zeros. Pretty neat.

@ignaloidas @Mikoto
bencode has length fields in it, but only for strings, which appear only at the lowest level of nesting. Higher-level structures are delimiter-based.
en.wikipedia.org/wiki/Bencode#

@wolf480pl I mostly had in mind https://en.wikipedia.org/wiki/Canonical_S-expressions which only counts the length of atoms (I just realised that it was created by this person lol https://en.wikipedia.org/wiki/Ron_Rivest)

@Mikoto well yeah, if only atoms have length, there's no room for inconsistency. bencode does that, too.

And considering that csexpr were created to overcome problems with X.509, which has length fields on all levels, it all makes sense.

@wolf480pl ini wins again!
microsoft did everything right, how will the world ever recover?!
@wolf480pl note: my instance had some issues and my admin had to rollback the db so my posts in this thread are gone. If there is anyone who can't read my posts check out https://lets.bemoe.online/notice/9vt48zjenBP1aewNVo

@wolf480pl All data is stored in 10-data-bit units. Bytes are for amateurs. :)

@wolf480pl some syntax elements completely change the meaning of what came before them. mix prefix, infix, and postfix, as much as possible.

@wolf480pl and yeah, trying to mimic plain english is always a terrible idea, so it's perfect for this.
includes are also always fun. the less cleanly specified the better.

@wolf480pl also: start out as a superset (*) of a previous format and then grow more and more features on top of it.

*: bonus points if it turns out you missed an edge case right at the beginning so it was never a superset to begin with

@grainloom @wolf480pl Better: start as a subset of a previous format, and add equivalent features, that look the same but works slightly different.

@grainloom oh, like a special sequence that switches the parser to a different mode? Something like <script> or <![CDATA[ ?

@wolf480pl @grainloom Specify that numbers are "floating point with undefined precision" (JSON, I'm looking at you)

@loke @grainloom
oh...
yeah, length-prefixed binary floats with unspecified precision.
You have to guess where the exponent ends and mantissa starts.

@rick_777 @wolf480pl @grainloom And timestamps specified in local time (without DST specification of course).

@wolf480pl Can’t think of anything worse than C++, which is literally undecidable.

@Mikoto @normandy
ok, but how many times has that lead to CVE, or at least caused the compiler to segfault for reasons other than running out of RAM?

@wolf480pl there's a pulsar data file format whose header consists of keywords and values - the length of the values is implied by the keyword, and each version adds a few new potential keywords. So one new keyword in a file and old software can't read it. Not that there is a formal specification or list of keywords or anything, possibly not even well-defined endianness.

@wolf480pl The format is mostly text-based, but certain parts are binary. NUL bytes are a regular occurrence, but usually fairly late in the file and not near the beginning.

Ints use a variable-length encoding with a mandatory fractional part.

At least one popular implementation wildly fucks up the length prefixes in a way that mandates a certain clamping behavior in all other code that wants to interop.

@riking sounds oddly specififc. Almost as if you were describing an existing format...

@wolf480pl

(It worth noticing that I actually did something like this, somewhere... and they worked like a charm)

@Shamar Microsoft did that too with Windows Registry.

Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!