Recording

Commit volatile memory to persistent append-only log

0%

Invariants in LevelDB algorithm

Recently, I write a LevelDB implementation in Go. In this post, I summarize
some invariants in algorithm used by LevelDB implementations.

Sequence Number

Sequence number is a monotonically increasing 56 bits integer value. Every
time a key is written to LevelDB, it is tagged with a sequence number one
larger than sequence number tagged with previous key written to LevelDB.
If two entries in LevelDB have same user level keys, the one with larger
sequence must shadow the other.

Sorted Memory Tables

Writes are first recorded in a mutable memory table. If that memory table is
full, it is marked as immutable and a new memory table is created to record
writes. The memory table marked as immutable is then compacted to a sorted
disk table in level 0 and deleted. Thus we conclude that:

If a key appears in mutable memory table, it is newest. Otherwise, if it
appears in immutable memory table, it is newest.

Compaction

When an immutable memory table is compacted to a sorted table in Level 0, it
is assigned with a file number larger than all existing file numbers in this
Level. Thus we have:

Entries stored in newer file in Level 0 shadow entries with same user keys
in older files.

When there are too many files in Level 0 or too many data in Level 1 or above,
we start level compaction. In level N compaction, we select a file set from
level N, such that no remaining files overlap this file set in user key level.
Then we compact this file set with all overlapping files from level N+1, and
produce sorted tables in level N+1. Thus we conclude to two invariants.

No files in level 1 or above overlap other files in the same level.

Entries stored in level N shadow entries with same user keys above level N.

Snapshot

Snapshot in LevelDB is represented by a sequence number. When retrieving entries
from a snapshot, entries whose sequence number are larger than the snapshot’s
are ignored. In compaction, we record the smallest sequence number among alive
snapshots, and we never delete keys shadowed only by entry with sequence number
larger that this smallest sequence number. After enforcing these restrictions
on retrieve and compaction, we guarantee that:

The state a snapshot captured is frozen in its creation time.

Get

According to invariants we guarantee, we first search mutable memory table,
then immutable memory table, then sorted tables in level 0 from newest to
oldest, then sorted tables in level 1 and above. If we find a write for the
key, we can return the result to caller without further searching.

Iteration

Entries in memory tables and disk level tables may overlap each other. But
files in level 1 and above have no overlapping files in same level. So we
concatenate files in level N to form level iterator for level N. Then we
merge those iterators with iterators from memory tables, level 0 tables to
construct iterator over a whole LevelDB database. In iteration, we filter
out entries shadowed by larger sequence number.

Optimizations

Implementations can make their private optimizations without break those
invariants. One such optimization in offical C++ implementation is that
immutable memory table can be compacted to level 1 or above if it is not
overlapping with younger level. Aggressive optimization, for example,
concurrent compaction is also possible.