The Page Cache

The OS lies to you. write() returns, but data is still in RAM. Learn about fsync() and durability.

User space

Write something →

↓ write()

Kernel page cache (RAM)

volatile — lost on crash

empty (or all flushed)

↓ OS eventually flushes (dirty writeback)

Disk — persistent

no data yet

Log

No operations yet

Sync strategy

Durability
Performance

write() puts data in page cache. OS flushes eventually (~30s). Fastest but unsafe.

Key Insight

Durability requires explicit action. When write() returns, your data is in the kernel's page cache — which is RAM. A power cut, kernel panic, or accidental shutdown wipes it. The file on disk has not changed.

fdatasync() / sync_data() forces the dirty pages to the storage device. For a database, this is the difference between "the transaction committed" and "we think it committed".

This is why the WAL calls fdatasync() before acknowledging a write — and why RocksDB lets you trade durability for performance with sync=false.

Now: since every write is slow (especially with sync), you want to batch as many writes as possible into a single write() + fsync() pair. That's not optional — it's dictated by the hardware.