The Page Cache
The OS lies to you. write() returns, but data is still in RAM. Learn about fsync() and durability.
User space
Write something →
Kernel page cache (RAM)
volatile — lost on crashempty (or all flushed)
Disk — persistent
no data yet
Log
No operations yet
Sync strategy
write() puts data in page cache. OS flushes eventually (~30s). Fastest but unsafe.
Key Insight
Durability requires explicit action. When write() returns, your data is in the kernel's page cache — which is RAM. A power cut, kernel panic, or accidental shutdown wipes it. The file on disk has not changed.
fdatasync() / sync_data() forces the dirty pages to the storage device. For a database, this is the difference between "the transaction committed" and "we think it committed".
This is why the WAL calls fdatasync() before acknowledging a write — and why RocksDB lets you trade durability for performance with sync=false.
Now: since every write is slow (especially with sync), you want to batch as many writes as possible into a single write() + fsync() pair. That's not optional — it's dictated by the hardware.