Binary Encoding Deep Dive

From key-value pairs to bytes on disk

Side Quest~20 min

When you write db.put("apple", "red fruit"), how does that become bytes on disk? The OS doesn't understand "keys" and "values" - it only knows read() and write()of raw bytes.

The Fundamental Problem

If you write two strings back-to-back, how do you know where one ends and the next begins?

// BAD: No boundaries!
write("apple");
write("banana");
// On disk: "applebanana" - where does apple end?

We need a framing protocol - a way to encode boundaries into the byte stream.

Common Solutions

1. Length Prefixing

Write the length before each piece of data. The reader knows exactly how many bytes to read.

[5]apple[6]banana
Read length (5) → Read 5 bytes → Read length (6) → Read 6 bytes
Used by: Protocol Buffers, RocksDB, most binary formats

2. Delimiters

Use a special byte (like \0 or \n) to mark boundaries.

apple\0banana\0
Read until \0 → That's one value → Repeat
Problem: What if the data contains the delimiter? Needs escaping.

3. Fixed Size

Every field is exactly N bytes. Pad shorter values, truncate longer ones.

apple___(8 bytes)banana__(8 bytes)
Problem: Wastes space on small values, can't handle large values.

VarInt: Variable-Length Integers

How do you encode the length itself? If lengths can be 1 to 1,000,000+, using a fixed 4 bytes wastes space for small values.

The VarInt Trick

Use 7 bits of each byte for data, 1 bit to signal "more bytes follow".

ValueBinaryBytes
1000000011 byte
127011111111 byte
12810000000 000000012 bytes
30010101100 000000102 bytes

Green (0) = last byte,Red (1) = more bytes follow

See It In Action

Select entry:

Logical Structure

Key Length1 byte
VarInt: 5
Key5 bytes
UTF-8: "apple"
Value Length1 byte
VarInt: 9
Value9 bytes
UTF-8: "red fruit"
Total:16 bytes

Binary Output

Key Length:
05
Key:
6170706C65
Value Length:
09
Value:
726564206672756974
Raw bytes:
05 61 70 70 6C 65 09 72 65 64 20 66 72 75 69 74

How RocksDB Does It

RocksDB uses length-prefixed encoding with VarInts. Each key-value entry in a data block:

shared_bytesunshared_bytesvalue_lengthkey_deltavalue
All lengths are VarInt encoded. Keys use prefix compression (shared_bytes from previous key).

Key Takeaways

  • Length-prefixing is the standard way to frame variable-length data
  • VarInt saves space by using 1 byte for small numbers, more for larger ones
  • Binary format design affects both space efficiency and read/write speed
  • Understanding binary encoding helps you debug corrupt data and optimize storage