Commit volatile memory to persistent append-only log


Years ago, I wrote a client library for Apache BookKeeper and chose Send, !Sync and Clone for the client object initially.

  1. Clone and Send so the client can be shared among concurrent tasks.
  2. !Sync so the every clients can have their own data without cross thread synchronization.

This way the client can batch simultaneous requests in single asynchronous task and serve parallel requests in multiple concurrent asynchronous tasks. But it failed due to .await requires &self to be Send which is not possible by definition if Self is !Sync.

I complained it a lot with quotes from What shall Sync mean across an .await?. Recently, in developing spawns, I found many async runtimes have spawn_local to spawn !Send tasks. It is boring. I said A future should be Send unless it captures !Send before. Currently, some !Send tasks should actually be Send. This time, I want to go further about what make a future !Send and how Rust could solve them.

Before continue, I want to state two points.

Codes before .await happens before codes after .await.

This is the ground truth in our mental, otherwise everything fucked up. It is same for codes in thread and process.

Rust future is a combination of states and Future::poll. .await is the call site of Future::poll which advances states. Then above statement become: Future::poll observe data changes from last run. In single thread executor, Future::poll is invoked sequentially, so above statement hold. In multi-thread executor, thread which acquire the future will observe changes made in thread which release that future. Multi-thread executors are considered buggy if they can’t guarantee above statement.

Futures are self-contained concurrent execution units, just like threads to multi-cores.

From above, we know that codes in future are executed sequentially, we fear no contention inside single future and we are capable to run multiple futures concurrently. Additionally, a Future + Send + 'static is self-contained, it contains nothing !Send or no static to outside. All those are same to what threads provide to us, self-contained sequential execution unit in itself but concurrent with each other. If we are able to use !Send after thread::sleep, we should be able to do the same after .await.

Read more »

Concurrency is hard. Java has no exception. In this post and possible future posts, I will record traps and pitfalls, I experienced or heard, in Java.

Nested write in ConcurrentHashMap.compute could deadlock

ConcurrentHashMap uses bucket level lock in write operations (e.g. put, compute) to protect bucket nodes. If nested writing key falls to the same bucket ConcurrentHashMap.compute is serving, then it deadlocks. The javadoc of ConcurrentHashMap.compute and its siblings warn this.

Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this Map.

I encountered this once in production code and the “update” is shadowed by ServiceLoader.

There are others encountering this.

CompletableFuture.complete will run non-async computations if it completes the future

Actions supplied for dependent completions of non-async methods may be performed by the thread that completes the current CompletableFuture, or by any other caller of a completion method.

I think it is not a good design, it makes CompletableFuture.complete vulnerable to CompletableFuture.then, CompletableFuture.when and CompletableFuture.handle. I did see code in production utilize this subtlety to build strong happen-before relation between when and code after complete.

Read more »

It has been decade since Spanner published, and several Spanner inspired open source projects has been approached to prevailing, says CockroachDB and YugabyteDB in lack of TrueTime. It should be much easier to understand Spanner with these pioneer open source projects today.

What are tablet, paxos group, directory and fragment ?

A directory is the unit of data placement. All data in a directory has the same replication configuration.

a Spanner tablet is a container that may encapsulate multiple partitions of the row space. We made this decision so that it would be possible to colocate multiple directories that are frequently accessed together.

A directory is also the smallest unit whose geographic-replication properties (or placement, for short) can be specified by an application. The design of our placement-specification language separates responsibilities for managing replication configurations. Administrators control two dimensions: the number and types of replicas, and the geographic placement of those replicas. They create a menu of named options in these two dimensions (e.g., North America, replicated 5 ways with 1 witness). An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options. For example, an application might store each end-user’s data in its own directory, which would enable user A’s data to have three replicas in Europe, and user B’s data to have five replicas in North America.

For expository clarity we have over-simplified. In fact, Spanner will shard a directory into multiple fragments if it grows too large. Fragments may be served from different Paxos groups (and therefore different servers). Movedir actually moves fragments, and not whole directories, between groups.

I map these to CockroachDB’s concepts.

Spanner CockroachDB What
Tablet Store Colocation container for different fragments/ranges
Paxos group Multi raft group Tablet’s consensus group
Fragment Range Replica Continuous key space
Directory Table, database and other Replication zones Logical container of fragments/ranges for data placement specification

Why long-lived leader ?

Spanner’s Paxos implementation uses timed leases to make leadership long-lived (10 seconds by default). A potential leader sends requests for timed lease votes; upon receiving a quorum of lease votes the leader knows it has a lease. A replica extends its lease vote implicitly on a successful write, and the leader requests lease-vote extensions if they are near expiration. Define a leader’s lease interval as starting when it discovers it has a quorum of lease votes, and as ending when it no longer has a quorum of lease votes (because some have expired). Spanner depends on the following disjointness invariant: for each Paxos group, each Paxos leader’s lease interval is disjoint from every other leader’s.

Read more »

This is my second time to read through Raft Algorithm, and it is hard to
recall what I have learned in first reading. This time I decide to record my thoughts for future
recall. Hope it is useful to newbies in distributed systems like me.

What does consensus algorithm mean ?

Consensus algorithm is the process used to achieve agreement on shared state among faulty
in distributed system.

Introduce Replicated State Machines

Here, we define state machine as State' = Machine(State, Input) for simplicity. A state machine
can be defined with its start state, and makes progress with sequence of inputs, produces sequence
of intermediate states. For examples:

State' = Machine(State, Input)
State'' = Machine(State', Input')
State''' = Machine(State'', Input'')

Given a state machine, how can we figure out that it is a replicated state machine ?

First, replicated state machine is deterministic. Given same start state with same sequence of
inputs, the state machine always produce same intermediate states.

Second, two state machines built from same logic on possibly different processes or nodes and
even possibly different languages must be same. Here we define two state machines as same based
on deterministic: given same start state and same sequence of inputs, if two state machines
produce same sequence of intermediate states, we say these two state machine are same. Thus given
multiple copies of same state machines with same start state, feeding with same sequence of inputs,
they must produces same sequence of intermediate states.

Third, states, including start state and intermediate states, and inputs must be self-contained,
thus can be replicated to other processes or nodes with help of serialization and deserialization.

Read more »

Spring AOP is proxy-based, using either JDK dynamic proxy or CGLIB. Spring’s Cache Abstraction, Transaction Management and Asynchronous Execution are all built upon AOP proxies.

However proxy can intercept only external method calls. Which means that self-invocation, in effect, a method within the target object calling another method of the target object, will not lead to an actual interception at runtime.

Thus, the following code will not function correctly.

public Class SomeServiceImpl implements SomeService {

@Cacheable(value = "SomeCache", key = "#kind + '#' + #id")
private Object methodBase(String kind, String id) {
// ...
return result;

public Object methodA(String id) {
return methodBase("a", id);

public Object methodB(String id) {
return methodBase("b", id);


AspectJ Load-Time Weaving

Spring provides a library named spring-aspects and AdviceMode.ASPECTJ which can be used as value of mode filed for @EnableCaching, @EnableTransactionManagement and @EnableAsync annotations to support AspectJ load-time weaving. But that is not enough, AspectJ load-time weaver is required to weave aspect for target class.

AspectJ Load-Time Weaver weave target class by transforming class file bytecode using a ClassFileTransformer named ClassPreProcessorAgentAdapter from aspectjweaver.jar. This transformation can be performed either at JVM level through Instrumentation interface or at per ClassLoader level. A method with signature similar to void addTransformer(ClassFileTransformer transformer) is required to apply the transformation in class loading phase, Instrumentation support this method natively, while not all class loaders support this.

Custom class loader to apply class file transformation

Spring’s @EnableLoadTimeWeaving creates a bean named loadTimeWeaver to inject a ClassFileTransformer, which is capable to weave target class with desired apsect, to bean class loader. Unfortunately, Spring Boot does not support this approach.

Due to the fact that loadTimeWeaver is a bean and classloading is happening at bean definition parsing phase which certainly happens before bean creation phase in same application context, thus @EnableLoadTimeWeaving should be enabled in a application context which is a ancestor of the application context where target class located in.

Read more »