Invariants in LevelDB algorithm

Recently, I write a LevelDB implementation in Go. In this post, I summarize some invariants in algorithm used by LevelDB implementations.

Sequence Number

Sequence number is a monotonically increasing 56 bits integer value. Every time a key is written to LevelDB, it is tagged with a sequence number one larger than sequence number tagged with previous key written to LevelDB. If two entries in LevelDB have same user level keys, the one with larger sequence must shadow the other.

Sorted Memory Tables

Writes are first recorded in a mutable memory table. If that memory table is full, it is marked as immutable and a new memory table is created to record writes. The memory table marked as immutable is then compacted to a sorted disk table in level 0 and deleted. Thus we conclude that:

If a key appears in mutable memory table, it is newest. Otherwise, if it appears in immutable memory table, it is newest.

Compaction

When an immutable memory table is compacted to a sorted table in Level 0, it is assigned with a file number larger than all existing file numbers in this Level. Thus we have:

Entries stored in newer file in Level 0 shadow entries with same user keys in older files.

When there are too many files in Level 0 or too many data in Level 1 or above, we start level compaction. In level N compaction, we select a file set from level N, such that no remaining files overlap this file set in user key level. Then we compact this file set with all overlapping files from level N+1, and produce sorted tables in level N+1. Thus we conclude to two invariants.

No files in level 1 or above overlap other files in the same level.

Entries stored in level N shadow entries with same user keys above level N.

Snapshot

Snapshot in LevelDB is represented by a sequence number. When retrieving entries from a snapshot, entries whose sequence number are larger than the snapshot’s are ignored. In compaction, we record the smallest sequence number among alive snapshots, and we never delete keys shadowed only by entry with sequence number larger that this smallest sequence number. After enforcing these restrictions on retrieve and compaction, we guarantee that:

The state a snapshot captured is frozen in its creation time.

Get

According to invariants we guarantee, we first search mutable memory table, then immutable memory table, then sorted tables in level 0 from newest to oldest, then sorted tables in level 1 and above. If we find a write for the key, we can return the result to caller without further searching.

Iteration

Entries in memory tables and disk level tables may overlap each other. But files in level 1 and above have no overlapping files in same level. So we concatenate files in level N to form level iterator for level N. Then we merge those iterators with iterators from memory tables, level 0 tables to construct iterator over a whole LevelDB database. In iteration, we filter out entries shadowed by larger sequence number.

Optimizations

Implementations can make their private optimizations without break those invariants. One such optimization in offical C++ implementation is that immutable memory table can be compacted to level 1 or above if it is not overlapping with younger level. Aggressive optimization, for example, concurrent compaction is also possible.