Numbers Have Secrets

The number 300 can be [0x2C, 0x01] or [0x01, 0x2C] or [0xAC, 0x02]. Endianness, alignment, and varints.

3000x12C

00000001 00101100

All representations of 300

The ambiguity problem

The byte sequence 2c 01 has multiple valid interpretations:

  • → as u16 little-endian: 300
  • → as u16 big-endian: 11265
  • → as varint: 44

The schema must specify: what type? what endianness? Otherwise the bytes are meaningless.

Interesting examples

Varint (LEB128) algorithm

// Take 7 bits at a time
loop {
let b = (n & 0x7F) as u8;
n >>= 7;
if n != 0 { b |= 0x80; }
write(b);
if n == 0 { break; }
}

The high bit signals more bytes follow. The low 7 bits carry the value, LSB first.

Key Insight

Every encoding of a number is a tradeoff: space vs speed vs range vs complexity.

Fixed-width (u32, u64)

Constant decode time. Random-access. Wastes space for small numbers.

Varint (LEB128)

1–5 bytes. Dense for small values. Sequential decode only. Used by RocksDB, protobuf.

Little-endian

LSB first. Native for x86/ARM. Better for partial int reads.

Big-endian

MSB first. Network order. Lexicographic sort works naturally.

With framing (Chapter 2) and number encoding (this chapter), you have all the primitives. Next: how do you combine them into a codec — an encoder and decoder for a complete data structure?