Frank Yu at QCon San Francisco 2022


During QCon San Francisco 2022, Frank Yu, Senior Director of Engineering at Coinbase, presented Leaving Determinism, which draws first-hand experiences and examples from building and running financial exchanges. This talk is part of the Building Modern Backends editorial track.

Frank began his talk by making two assertions for deterministic logic:

  • If you have significant logic, make it deterministic.

  • If you have deterministic logic – don’t be afraid to replay it anywhere (for efficiency and profit)

After a brief introduction to the history of Coinbase derivatives exchanges, Frank describes the work his team does for a trading exchange as “critical code” that must

  • Be correct. The order of magnitude of the money involved compared to the expected revenue from an exchange is enormous.

  • Have consistent and predictable performance. For Coinbase specifically, it’s about having a 99th percentile response time to stay comfortably under 1ms

  • Remember everything for audibility. Regulations require everything to be able to reproduce in the exact state for every millisecond over the last 7 years.

In order to continue adding features and evolving the underlying system in a reasonable timeframe without introducing critical bugs into the production system, Frank points out that

“We have to make sure that the basic logic is kept simple. We also avoid competition at all costs in our logic.”

A system is deterministic if, given exactly the same set of inputs in the same order, we get exactly the same state and the same outputs. In practice, take all sequenced queries in a log and apply determinism, we will efficiently get replicated status and output events.

Since Coinbase operates on microsecond timescales, a very fast consensus algorithm was needed. Frank and his team were able to leverage Aeron Cluster to provide support for fault-tolerant services as replicated state machines based on the Raft consensus algorithm.

The mismatch of input and output events can cause jitter and scaling bottlenecks. System inlet sizes and rates are often consistent and predictable. However, the sizes and output rates of a system are often difficult to predict and difficult to validate. Frank further explains this phenomenon with a direct example from the Coinbase system where one event can result in multiple new events, potentially running into the thundering herd problem and incurring very expensive data input and output costs from their cloud provider.

Since deterministic systems provide consistent computation, generating the same output with the same input and system state, the replicated systems and the original request should give the same output as if the output had been sent from the node previous to the system. This offers a new school of thought – in a deterministic system, rather than replicating data, the computation can also be replicated and achieve a more stable and predictable system.

Frank concluded his talk by summarizing what should we think about when building a system like this:

  • Replicate well-tested code or bugs will also replicate

  • No Drift: old behavior should be respected when replaying inputs

  • Activate the new behavior with a request to the monolith after deployment

  • Use a seed for deterministic pseudo-random outputs

  • Break large chunks of work into steps

  • Everything must be remembered

  • You’d be surprised how much data fits in memory

  • You’d be surprised how much work fits on one CPU core

  • Keep your 99 and 99.9 low

  • Protect your monolith from talkative customers

Todd Montgomery previously gave a talk on Aeron Cluster at QCon New York 2018. And Martin Thompson previously gave another talk on Cluster Consensus with Aeron at QCon London 2018.

Further discussions on building modern backends will be recorded and made available on InfoQ over the coming months.


Comments are closed.