shuppy (@delan) — [archived] 2024-05-03T01:32:55.876Z

bark. bark bark (@bark) — [archived] 2022-12-10T06:01:28.529Z

a little journey

T+0 days A firmware image ✨

It looks like this (kinda - in reality it's one of many blocks of data already yoinked out of a bigger container format, but that's boring).

00000000: 485a 4c01 0000 0d00 a823 f243 80c4 3fd5  HZL......#.C..?.
00000010: d80c 0027 c080 460e 8844 a051 d00c 14a0  ...'..F..D.Q....
00000020: 408a 7454 c416 b703 83f2 c773 75bc dcf3  @.tT.......su...
00000030: a9de a958 2470 ab16 1397 daab cb6b b11e  ...X$p.......k..
00000040: fd13 0fb6 c171 66fa 7934 777b 4cc0 ff7e  .....qf.y4w{L..~
00000050: ff0c 9777 f5ca b37d 2398 ebff d731 56f4  ...w...}#....1V.
00000060: c3b0 d17a efaf 5188 ce46 62d9 3d3c 4a69  ...z..Q..Fb.=.Ji
00000070: cce4 ceb1 5d4a cec3 d1e4 d652 6f57 fcac  ....]J.....RoW..
          ...

Nothing immediately recognisable as data or a known file format. The entropy graph looks like this.

An entropy graph where the value hovers around 0.95 for the entire length of the file. There're some drops, but they're short in both "duration" and don't go lower than about 0.9.

So, probably not encrypted? Encryption usually gives a veeeery flat line at an entropy of 1. From experience, machine code likes to hover around 0.8ish, so this is probably compressed. Now what.

The 485a4c01 in the first bytes of the file maybe gives a hint. HZL\x01 or \x01LZH, and we know LZ is a common compression technique. LZH, turns out, is also a compression format! Let's throw it at a ton of LZ and LZH decompressors (mostly by grabbing horrible code off the internet and forcing the data in, since this is of course not a full archive file with headers and stuff) and see what happens.

T+7 daysThat didn't work.

Maybe we can try reverse engineering this compression format ourselves. After spending a week looking at LZ and LZH I've reinforced a decent knowledge base of compression techniques, so this shouldn't be too hard...

uhhh

Several text editor windows open on top of each other. They're alternately filled with binary and hex data, randomly spaced out and commented, but none of it is really consistent or makes any sense.

T+9^? days We're not there yet. We're throwing a lot of ideas at the wall, with very few ways to check if the results make sense. RE (like all programming) thrives on a short feedback loop; we should be able to make an assumption or prediction and test its validity very quickly. That's quite hard when you're just messing with binary data in a text editor and don't know what "good output" looks like. Let's ignore this and blindly continue (this is not advice).

T+14 days It would be really nice if we knew what this decompressed to, so we could compare the pair and work out which bits are data and which bits are compression information. But then we'd have the data, so there'd be no point reversing the compression format.

but we don't have any known plaintext! none at all!

Wrong! We do! It's a big piece of firmware! Firmware isn't complete without at least one AES S-box, some paths leaking source code file names, debug strings leaking customer lists, and a couple pre-computed CRC lookup tables ^{also there's a checksum in that data so it's a pretty good assumption}. Neither compresses well, so can we look for them as literal data in the compressed file data to give us a hint about the format? Sure!

We'll grab the first bits of a CRC32 lookup table for a super generic everyone-uses-it one with polynomial 0x04c11db7:

  0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, ...

Little endian-ify it:

  00 00 00 00 96 30 07 77 2c 61 0e ee ba 51 09 99 ...

Drop the 00s (they would probably compress well), and hunt for the remaining bytes in the data. Gotta be a little careful, since it's unlikely the entire table would appear in the data with no control bytes in between at random places (to, y'know, tell the compression format that we're about to give it a bunch of literal data). It probably won't be as easy as just Control-F for the bytes and boom.

So, unsurprisingly, the bytes themselves don't appear verbatim in the file. But being a compression format, full bytes are often a little wasteful, so it's useful to use individual bits as control data. A dodgy Python script which takes our original file and converts it to one looong binary string might be useful, then we can search for the CRC32 data as a binary string to see if it's there - just not byte aligned.

Still nothing. Maybe each byte is prefixed by a couple bits? Some garbage in between each byte? We could write something that takes our search string, splits it into bytes, and searches for them as binary appearing in the file with some amount of don't-care bits in between. We could even use regex to make this easy and horrid!

T+14 days+10 mins Oh. We did it.

file:  ...111111001011010011000010000011110111011110010110010110000110000111011110111011011101010101000110000100111001100110001...
crc32:         10010110 00110000 00000111 01110111 00101100 01100001 00001110 11101110 10111010 01010001 00001001 10011001

It's there, with a single bit between each literal byte. The bit before each literal is always a 1, which is pretty common in LZ formats, so if we hit something that starts with 0 it's probably a reference ("go back X bytes and copy Y bytes from there to here").

Now it's just the "simple" affair of working out the format of references, and we'll have our data in no time. And, we know what type of CRC32 the checksum in the file header probably is, so we can check our work :3

T+15 days The reference format is a bit yucky, and I think the H in LZH (like in the actual formats called LZH) is referring to the use of Huffman coding for the values. Thankfully it's a static code.

References start with a 0 bit like predicted, and then an offset and length (both Huffman encoded). The Huffman encoding is as follows:

Huffman-encoded Field	Value
`1xxx`	`xxx`
`01xxxxxx`	`xxxxxx + 8`
`001xxxxxxxxx`	`xxxxxxxxx + 8 + 64`
`0001xxxxxxxxxxxx`	`xxxxxxxxxxxx + 8 + 64 + 512`
`0000xxxxxxxxxxxxxxxx`	`xxxxxxxxxxxxxxxx + 8 + 64 + 512 + 4096`

Spot the pattern? The number of 0s (up to a maximum of 3) before a 1 in the prefix tells you the length of the value, and then the offset added to the value is always the smallest value you can't represent in the previous encoding. Cute!

Now just write some Python or Rust or something. Check the CRC32 matches - it does - and... the real work can begin (you now have 3.7 MiB of firmware to analyse).

today

oh that seemed pretty easy. why'd it take two weeks

Think of all the dead ends. So, so many dead ends.

Do you recognise this format? I don't! Am I dumb? I don't know!

shuppy (@delan) — [archived] 2024-05-03T01:32:55.876Z