Commit volatile memory to persistent append-only log


Understanding Spanner

It has been decade since Spanner published, and several Spanner inspired open source projects has been approached to prevailing, says CockroachDB and YugabyteDB in lack of TrueTime. It should be much easier to understand Spanner with these pioneer open source projects today.

What are tablet, paxos group, directory and fragment ?

A directory is the unit of data placement. All data in a directory has the same replication configuration.

a Spanner tablet is a container that may encapsulate multiple partitions of the row space. We made this decision so that it would be possible to colocate multiple directories that are frequently accessed together.

A directory is also the smallest unit whose geographic-replication properties (or placement, for short) can be specified by an application. The design of our placement-specification language separates responsibilities for managing replication configurations. Administrators control two dimensions: the number and types of replicas, and the geographic placement of those replicas. They create a menu of named options in these two dimensions (e.g., North America, replicated 5 ways with 1 witness). An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options. For example, an application might store each end-user’s data in its own directory, which would enable user A’s data to have three replicas in Europe, and user B’s data to have five replicas in North America.

For expository clarity we have over-simplified. In fact, Spanner will shard a directory into multiple fragments if it grows too large. Fragments may be served from different Paxos groups (and therefore different servers). Movedir actually moves fragments, and not whole directories, between groups.

I map these to CockroachDB’s concepts.

Spanner CockroachDB What
Tablet Store Colocation container for different fragments/ranges
Paxos group Multi raft group Tablet’s consensus group
Fragment Range Replica Continuous key space
Directory Table, database and other Replication zones Logical container of fragments/ranges for data placement specification

Why long-lived leader ?

Spanner’s Paxos implementation uses timed leases to make leadership long-lived (10 seconds by default). A potential leader sends requests for timed lease votes; upon receiving a quorum of lease votes the leader knows it has a lease. A replica extends its lease vote implicitly on a successful write, and the leader requests lease-vote extensions if they are near expiration. Define a leader’s lease interval as starting when it discovers it has a quorum of lease votes, and as ending when it no longer has a quorum of lease votes (because some have expired). Spanner depends on the following disjointness invariant: for each Paxos group, each Paxos leader’s lease interval is disjoint from every other leader’s.

Writes and their commit timestamps in Spanner and alikes are persistent through consensus group. When leader receives a no stale read only request, it has to choice a read timestamp treadt_{read} which must satisfy tread>=tleadermax_committed=tgroupmax_committedt_{read} >= t_{leader}^{max\_committed} = t_{group}^{max\_committed} to keep result set not stale.

Ideally, it should be able to satisfy this requirement without resort to consensus group as it is the leader. However, in present of leader change, there could be unobserved committed writes after tleadermax_committedt_{leader}^{max\_committed} , eg. tgroupmax_committed>tleadermax_committedt_{group}^{max\_committed} > t_{leader}^{max\_committed} , this breaks no stale requirement. Thus, the leader has to go through consensus group to ensure that no leadership change between treadt_{read} and tleadermax_committedt_{leader}^{max\_committed} . This is likely unacceptable in most situations especially OLTP where read latency is crucial.

With leader lease, the leader could freely pick a treadt_{read} in interval [tleadermax_committed,tleaderlease_expiration][t_{leader}^{max\_committed}, t_{leader}^{lease\_expiration}] without touching consensus as new leader will not commit a write prior to old leader’s lease expiration.

How Spanner satisfy serializability ?

Transactional reads and writes use two-phase locking. As a result, they can be assigned timestamps at any time when all locks have been acquired, but before any locks have been released. For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos write that represents the transaction commit

Reads within read-write transactions use wound-wait to avoid deadlocks. The client issues reads to the leader replica of the appropriate group, which acquires read locks and then reads the most recent data. While a client transaction remains open, it sends keepalive messages to prevent participant leaders from timing out its transaction. When a client has completed all reads and buffered all writes, it begins two-phase commit.

It is a lock-based implementation of write snapshot isolation from A Critique of Snapshot Isolation’s perspective.

CockroachDB implements write snapshot isolation also, but without explicit locking.

  • CockroachDB starts with txread==txcommittx_{read} == tx_{commit} .
  • CockroachDB stores txreadtx_{read} in timestamp cache and prevents writes beneath it. We could treat timestamp cache as optimistic implicit lock facility.
  • txcommittx_{commit} could be pushed to newer timestamp due to more recent reads in timestamp cache and writes in mvcc storage.
  • CockroachDB refreshes and bumps txreadtx_{read} to txcommittx_{commit} in case of no writes in between.
  • Ensure that txcommittx_{commit} resides beneath tleaderlease_expirationt_{leader}^{lease\_expiration} .

Both Spanner and CockroachDB try to align txreadtx_{read} to txcommittx_{commit} by preventing writes to read set to achieve serializability.

Does Spanner allow “read your writes” inside transaction ?

Like Bigtable, writes that occur in a transaction are buffered at the client until commit. As a result, reads in a transaction do not see the effects of the transaction’s writes. This design works well in Spanner because a read returns the timestamps of any data read, and uncommitted writes have not yet been assigned timestamps.

No. I think it is a must to write uncommitted writes to server in order to achieve “read your writes” in no trivial read. This enables complex low latency access patterns in server side. Both CockroachDB and YugabyteDB submit writes to server to “read own writes”.

How Spanner satisfy linearizability ?

Spanner enforces the following external-consistency invariant: if the start of a transaction T2T_2 occurs after the commit of a transaction T1T_1, then the commit timestamp of T2T_2 must be greater than the commit timestamp of T1T_1. Define the start and commit events for a transaction TiT_i by eistarte^{start}_i and eicommite^{commit}_i; and the commit timestamp of a transaction TiT_i by sis_i. The invariant becomes tabs(e1commit)<tabs(e2start)s1<s2tabs(e^{commit}_1) < tabs(e^{start}_2) ⇒ s1 < s2. The protocol for executing transactions and assigning timestamps obeys two rules, which together guarantee this invariant, as shown below. Define the arrival event of the commit request at the coordinator leader for a write TiT_i to be eiservere^{server}_i.

Start: The coordinator leader for a write TiT_i assigns a commit timestamp sis_i no less than the value of, computed after eiservere^{server}_i.

Commit Wait: The coordinator leader ensures that clients cannot see any data committed by TiT_i until TT.after(si)TT.after(s_i) is true. Commit wait ensures that sis_i is less than the absolute commit time of TiT_i, or si<tabs(eicommit)s_i < t_{abs}(e_i^{commit}) .

It is commit-wait. Before declaring a transaction TiT_i with commit timestamp sis_i visible, it waits until TT.after(si)TT.after(s_i), aka. si< < This means that tabs(enow)>sit_{abs}(e_{now}) > s_i on all Spanner servers. Hence no transaction will commit before sis_i after observing TiT_i. This prevents “causal reverse anomaly” in CockroachDB’s term.

How transaction commit works ?

If a transaction involves only one Paxos group (as is the case for most transactions), it can bypass the transaction manager, since the lock table and Paxos together provide transactionality.

YugabyteDB supports Single-row ACID transactions ;

If a transaction involves more than one Paxos group, those groups’ leaders coordinate to perform two-phase commit. One of the participant groups is chosen as the coordinator: the participant leader of that group will be referred to as the coordinator leader, and the slaves of that group as coordinator slaves. The state of each transaction manager is stored in the underlying Paxos group (and therefore is replicated).

The choice of coordiantor group could be pretty simple. In CockroachDB, it is the group that owns transaction record and first write key. In YugabyteDB, it is the txn status tablet.

When a client has completed all reads and buffered all writes, it begins two-phase commit. The client chooses a coordinator group and sends a commit message to each participant’s leader with the identity of the coordinator and any buffered writes.

The buffered writes could be divided acoording to Paxos groups’ key spaces.

A non-coordinator-participant leader first acquires write locks. It then chooses a prepare timestamp that must be larger than any timestamps it has assigned to previous transactions (to preserve monotonicity), and logs a prepare record through Paxos. Each participant then notifies the coordinator of its prepare timestamp.

Now, these participant groups could not serve no stale reads with timestamp tread>=si,gpreparet_{read} >= s_{i,g}^{prepare} as it is unknown whether that transaction will be committed or aborted. It must block before TiT_i resolved.

The coordinator leader also first acquires write locks, but skips the prepare phase. It chooses a timestamp for the entire transaction after hearing from all other participant leaders. The commit timestamp ss must be greater or equal to all prepare timestamps, greater than at the time the coordinator received its commit message, and greater than any timestamps the leader has assigned to previous transactions (again, to preserve monotonicity). The coordinator leader then logs a commit record through Paxos (or an abort if it timed out while waiting on the other participants).

The transaction is irrevocably decided after commit/abort record accepted by consensus group, all left are resolving pending writes. CockroachDB and YugabyteDB use similar approach to persist transaction status.

Before allowing any coordinator replica to apply the commit record, the coordinator leader waits until TT.after(s), so as to obey the commit-wait rule. After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Notice that commit-wait block all clients but not only committing client from observing transaction committed result. All participants are suspicious and prevented from “causal reverse anomaly”.

I think we could optimize commit-wait by pipelining commit timestamp to all other participant leaders and let all participanting Paxos group leaders commit-wait parallelly to reduce no stale read latency.

Why lock writes in write snapshot isolation ?

According to A Critique of Snapshot Isolation, in write snapshot isolation write-write conflict avoidance is not necessary for serializability. So why both CockroachDB and Spanner lock writes in addition to write snapshot isolation ?

I see difficulties in absent of write-write conflict. When there are multiple writes on single keys, the pending writes resolution could be complicated due to unaligned concurrent transaction commits. In Spanner, participant leader has to wait all commit timestamps from pending transactions to resolve pending writes as it does not know concrete commit timestamp in prepare phase. In CockroachDB, range leaseholder has to persist pending writes in ascending order of commit timestamps.

There are certainly other reasons(eg. conflict rate, other engineering difficulties) aside from what I see.

How Spanner serve read only transactions ?

Every replica tracks a value called safe time tsafet_{safe} which is the maximum timestamp at which a replica is up-to-date. A replica can satisfy a read at a timestamp tt if t<=tsafet <= t_{safe}.

This means that there are no unobserved writes with commit timestamp t<=tsafet <= t_{safe}. This guarantee a consistent snapshot. It is closed timestamp in CockroachDB, and safe timestamp in YugabyteDB.

A read-only transaction executes in two phases: assign a timestamp sreads_{read}, and then execute the transaction’s reads as snapshot reads at sreads_{read}. The snapshot reads can execute at any replicas that are sufficiently up-to-date.

The simple assignment of{read} =, at any time after a transaction starts, preserves external consistency. However, such a timestamp may require the execution of the data reads at sreads_{read} to block if tsafet_{safe} has not advanced sufficiently. To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserves external consistency.

Spanner requires a scope expression for every read-only transaction, which is an expression that summarizes the keys that will be read by the entire transaction. Spanner automatically infers the scope for standalone queries. If the scope’s values are served by a single Paxos group, then the client issues the read-only transaction to that group’s leader. If the scope’s values are served by multiple Paxos groups, Spanner reads at{read} =

Define tsafe=min(tsafePaxos,tsafeTM)t_{safe} = min(t^{Paxos}_{safe}, t^{TM}_{safe}), where each Paxos state machine has a safe time tsafePaxost^{Paxos}_{safe} and each transaction manager has a safe time tsafeTMt^{TM}_{safe}.

tsafePaxost^{Paxos}_{safe} is simpler: it is the timestamp of the highest-applied Paxos write. Because timestamps increase monotonically and writes are applied in order, writes will no longer occur at or below tsafePaxost^{Paxos}_{safe} with respect to Paxos.

Paxos leader should be able to bump tsafePaxost^{Paxos}_{safe} to other timestamp within its lease to serve sread>tsafePaxoss_{read} > t^{Paxos}_{safe}.

tsafeTMt^{TM}_{safe} is \infty at a replica if there are zero prepared (but not committed) transactions—that is, transactions in between the two phases of two-phase commit.

In case of no prepared transactions, Paxos leader could serve no stale read directly without blocking as long as its leader lease is valid.

If there are any such transactions, participant knows a lower bound on a prepared transaction’s timestamp. Every participant leader (for a group gg) for a transaction TiT_i assigns a prepare timestamp si,gprepares^{prepare}_{i,g} to its prepare record. The coordinator leader ensures that the transaction’s commit timestamp si>=si,gprepares_i >= s^{prepare}_{i,g} over all participant groups gg. Therefore, for every replica in a group gg, over all transactions TiT_i prepared at gg, tsafeTM=mini(si,gprepare)1t^{TM}_{safe} = mini(s^{prepare}_{i,g}) − 1 over all transactions prepared at gg.

In case of prepared transactions, Spanner must wait all existing prepared transactions resolved.

All in all, for no stale read, Spanner may have to wait for prepared transactions to be resolved or tsafe>=sreadt_{safe} >= s_{read}. For low latency read, Spanner provides stale reads.

Notes on TrueTime

  • Valuable due to tight error bound. A 500ms error bound XyzTime is useless.
  • No communication across data centers. This restricts engineering difficulties to single data center.


Spanner is great for global deployment across data centers and continents.

  • It provides linearizability.
  • Commit wait is negligible as it overlap with Paxos communication.