80% reduction in response latency read! Our evaluation of the etcd 3.4 new features (read-write history contains development)

Author | Chen Jie (ink seal) Ali Cloud Development Engineer

REVIEW: etcd as K8s cluster storage components, read and write performance will be a lot of pressure, but etcd 3.4 new features will effectively ease the pressure on the history of reading and writing data from mechanism etcd paper starts, in-depth interpretation of the new etcd 3.4 characteristic.


etcd Kubernetes cluster is storing metadata, to ensure the consistency of the distributed components, which tend to affect the performance of the response time of the entire cluster. In use K8s, we found that in addition to the daily reading and writing pressure, there are some special scenes will etcd enormous pressure, such as apiserver component reboot or other components to bypass apiserver cache direct access to the latest etcd under K8s when the case of data, etcd will receive a lot of expensive read (the text will introduce the concept) request, which read and write etcd will exert enormous pressure. More seriously, if the client in the presence of a larger number of failed retry logic or client, a large amount of such a request, severe cases can cause etcd crash.

etcd 3.4 adds a feature called “Fully Concurrent Read” is, to a large extent solve the above problems. In this article we will focus on reading it. This article first reviews the historical development of etcd data read and write mechanism, after analysis of why this feature can significantly improve read and write performance etcd expensive read the next scene, the effect of the characteristics of the final validation by actual experiment.

etcd literacy development history

etcd v3.0 and earlier versions before

Raft etcd algorithm using strong data consistency, which ensures the consistency of the linear read operation. In the raft algorithm, the write operation is successful only that the write operation is commit to the log, and can not ensure that the current global state machine already apply the write log. And the process of the state machine with respect to apply log commit operation is asynchronous, and therefore immediately after a commit state machine reading stale data may be read.

In order to ensure consistency linear reading, early etcd (etcd v3.0) will all read and write requests to go again Raft agreement to satisfy strong consistency. In reality, however, generally use, read requests etcd accounted for the vast majority of all requests, if each read request protocol would walk again raft off the plate, etcd performance will be very poor.

etcd v3.1

Thus etcd v3.1 optimized version of the read request (PR # 6275), a method using a simple strategy to meet: index at this time the cluster commit record every time a read operation, when the state machine apply index greater than or equal to commit to return data index. At this time, since the commit index corresponding to the read state machine has a request to read the logs apply operation, in line with the linear consistent read request, it returns the results of the measurement can be read.

According to the contents of Chapter 6.4 of Raft paper, etcd by ReadIndex optimize reading of core operations for the following two guiding principles:

    Let Leader ReadIndex processing requests, Leader obtained read index commit index is the state machine, follower needs to receive the request forward to the Leader when ReadIndex request;

    Leader is still the guarantee of Leader, because the network reasons prevent partition, Leader no longer is the current Leader, Leader broadcasting needs to be confirmed quorum.

ReadIndex also allows each member of the cluster in response to a read request. When using ReadIndex method ensures that the member currently read key operation log has been apply, the client may return the value read. To achieve etcd ReadIndex, and there are relatively more article describes, this will not repeat.

etcd v3.2

Even etcd v3.1 optimized response time of a read request by ReadIndex method allows each member in response to a read request, but we continue to move down to the bottom perspective k / v boltdb storage layer, each individual member in the acquisition ReadIndex performance issues after reading any course.

v3.1 utilized to improve batch throughput write transaction, all write requests in a fixed period to commit boltDB. When the read and write transactions initiated by the upper layer to the underlying layer boltdb, will apply a lock transaction (e.g., the following code fragment), the coarse granularity lock transaction, all reads and writes will be limited. For smaller read transaction, the lock only reduces transaction throughput, and for a relatively large read transaction (there will be explained in detail later), you may block read, write, and even heartbeat member timeouts are likely to occur.

// release-3.2: mvcc/kvstore.go
func (s *store) TxnBegin() int64 {
    s.tx = s.b.BatchTx()
    // boltDB 事务锁,所有的读写事务都需要申请该锁

For the above-mentioned performance bottleneck, etcd v3.2 optimized version read boltdb layer, comprising two core points:

    To achieve “N reads or 1 write” in parallel, the above-mentioned coarse-grained lock refined into a read-write lock, parallel to each other in all read requests;

    Use buffer to increase throughput. 3.2 of readTx, batchTx increased by a buffer, all read transaction priority read from buffer, misses and then access boltDB through transactions. Similarly, a write transaction while writing boltDB, will write data to the buffer batchTx, and batch at the end of commit, batchTx the buffer will writeBack back readTx the buffer prevents dirty reads.

// release-3.3: mvcc/kvstore_txn.go
func (s *store) Read() TxnRead {
    tx := s.b.ReadTx()
    // 获取读事务的 RLock 后进行读操作

// release-3.3: mvcc/backend/batch_tx.go
func (t *batchTxBuffered) commit(stop bool) {
    // 获取读事务的 Lock 以确保 commit 之前所有的读事务都已经被关闭

Concurrent read complete

etcd v3.2 to read and write optimized to solve the performance bottleneck of reading and writing most of the scenes, but we’ll start from the client’s perspective, the scene back to the beginning of this article we mentioned, there is still the risk of leading to etcd read and write performance degradation .

Here, we first introduce the concept of an expensive read in etcd, all client read request last are transformed into range of requests inquiry to KV layer, we have key number and value size a range request to measure a read request of pressure. Taken together, the greater the number key when requested range, the greater the average corresponding key value size, the range of pressure request DB layer greater. The actual division expensive read and cheap read boundary is etcd cluster hardware capabilities may be.

From the perspective of the client, apiserver in a large cluster to conduct a full amount pod, node, pvc and other resource queries can be considered an expensive read. Under a brief analysis of why the expensive read boltDB will bring pressure. Mentioned above, in order to prevent dirty reads, there is no need to ensure that every time you commit read transactions, write transactions and therefore before each commit, you need to read all the current transaction is rolled back, so the need to apply readTx.lock commit the time interval point , the lock is upgraded from RLOCK () to lock (), which will upgrade the read-write lock may lead to blocking all read operations.

As shown below (the following figures, the blue strip is read transaction, write transaction green stripe, red stripe for the transaction due to lock issues blocking), t1 point in time will trigger commit, but there is a transaction is not completed, T5 commit the transaction because the application is locked obstruction to the point in time t2 was carried out. Ideally a large number of write transactions in a batch will be ended, so that every time a write transaction commit only a small part of the blocking read transaction (as shown in the T6 only blocked the transaction).

However, at this time if there are very large read request etcd, then the read-write lock will be upgraded frequently blocked. Below, T3 is a very long read transactions across a plurality commit batch. Each commit batch end point of time as usual triggers the write transaction commit, but due to the read-write lock can not be upgraded, write transaction T4 is delayed, the same write transaction t2 commit point T7 as not eligible as a write lock also been postponed.

In addition, after the write transaction commit was, you need to write the bucket cache information is written to the read cache, this time also need to upgrade readTx.lock Lock (). The Über backend.Read () getting readTx, the need to ensure the success of these bucket cache has been written over, need to apply for a read lock readTx.RLock (), and if there is a write transaction during this period, the lock can not be obtained, the read transaction We can not start. Under the above circumstances, the third batch (t2-t3) because of lack of other read transaction read lock can not be carried out.

In conclusion, due to the expensive read-write lock caused by frequent upgrades, leading to commit write transactions continue to be backward (usually this problem we will be called head-of-line blocking), resulting in an avalanche etcd read and write performance.

etcd v3.4, the addition of a “Fully Concurrent Read” of the feature, the core guiding principle is the following two points:

    The above write lock is removed (in fact the lock again refined), such that all read and write operations are no longer blocked by the locking rather frequently;

    Each batch interval no longer reset read transaction readTxn, but create a new instance of concurrentReadTxn to serve new read request, and the original readTxn will be closed at the end of all affairs. Each instance has a concurrentReadTxn own buffer cache.

In addition to these two changes, fully concurrent read map needs when creating a new instance of ConcurrentReadTx from ReadTx copy the corresponding buffer, there will be some overhead, the community also considering the copy buffer operation of lazy, write in each conduct or end point of each batch interval after the end of the transaction. However found in our experiments, the effect of bringing the copy is not large. Core code changes as shown in the following fragment:

// release-3.4: mvcc/backend/read_tx.go
type concurrentReadTx struct {
    // 每个 concurrentReadTx 实例保留一份 buffer,在创建时从 readTx 的 buffer 中获得一份 copy
    buf     txReadBuffer

// release-3.4: mvcc/backend/backend.go
func (b *backend) ConcurrentReadTx() ReadTx {
    // 由于需要从 readTx 拷贝 buffer,创建 concurrentReadTx 时需要对常驻的 readTx 上读锁。
    defer b.readTx.RUnlock()

// release-3.4: mvcc/backend/read_tx.go
// concurrentReadTx 的 RLock 中不做任何操作,不再阻塞读事务
func (rt *concurrentReadTx) RLock() {}

// release-3.4: mvcc/kvstore_tx.go
func (s *store) Read() TxnRead {
    // 调用 Read 接口时,返回 concurrentReadTx 而不是 readTx
    tx := s.b.ConcurrentReadTx()
    // concurrentReadTx 的 RLock 中不做任何操作

Let us return to the scene of the presence of expensive read mentioned above. After the changes fully concurrent read, write scenario as shown in FIG.

First, when mvcc create backend creates a resident of readTx example, there are lock conflicts write transactions batchTx and after only just this one instance. After all read requests (e.g. T1, T2, T3, etc.), creates a new service instance concurrentReadTx while requiring readTx copy from Buffer; as expensive read transaction occurs during T3, T4 is blocked and is no longer executed normally. At the same time T5 T4 commit wait completion, after readTx the buffer is updated, then the buffer copy, thus blocking a short time. And t2, t3 commit write transaction point of time T7, T8 and because not blocked and smoothly.

In the fully concurrent read-write mode, concurrentReadTx only created when there may be blocked (because dependent buffer copy operation from readTx), once the situation has created no longer blocked, the entire process has read and write throughput greatly improved.

Read and write performance verification experiment

For the new feature etcd v3.4 fully concurrent read, we conducted experiments comparing the change in the cluster increased read and write performance before and after the feature. To troubleshoot network factors interfere, we did a single node etcd test, but enough to be seen from the results of the advantages of this feature. The following is a set of validation experiments:

  • 读写设置

      Simulating the conventional storage clusters, it is written in advance ** 100k KVs **, each by one 128B key KV and a 1 ~ 32KB of random values ​​composition (average 16KB)

      expensive read: every range 20k keys, 1 per concurrent.

      cheap read: each range 10 keys, 100 per second concurrency.

      write: each time put 1 key, 20 per concurrent.

  • 对照组

      Ordinary write Scene: cheap read + write;

      There is a heavy read transaction simulation scenarios: cheap read + expensive read + write.

  • 对比版本:

      etcd – ali2019rc2 are not in the optimization

      etcd – ali2019rc3 join the optimization

  • Prevent accidental: each test case to run 5 times the average 99th bit (P99) of the response time as a result of the set of test case.

The results shown in the following table. For ordinary scene read, read and write performance of approximately 3.4 and 3.3; scene for the presence of heavy read transaction, 3.4 in fully concurrent read feature to some extent reduce the response time of the expensive read. And cheap read and write in this scenario, because the write lock RC2 cause very slow access speed, and rc3 implemented in parallel such that the reader fully reduced response time to about 1/7 the original.

| etcd

version | cheap

read + write |  | expensive

read + cheap read + write
p99 read (ms) p99 write (ms)
read (ms) p99 cheap read (ms) p99 write (ms)
3.3 14.1 15.1
3.4 (with FCR) 16.1 14.2

In other scenarios, such as in Kuberentes 5000 node performance tests, but also that in large-scale pressure reading, P99 write latency reduced 97.4%.

to sum up

etcd fully concurrent read expensive new feature optimized reduced by nearly 85% of the response delay write and read nearly 80% of the response delay, the read and write throughput while increasing ETCD solve the ETCD in read performance due to a large stress scenarios dip The problem. Process research and experiments in the sense of guidance Xieyu Mu, now we have followed the Community of the new capacity, after a long test performance and stability. The future we will continue to optimize the performance and stability etcd, and best practices to optimize and fed back to the community.


  • Etcd Fully Concurrent Read Design Proposal
  • Strong consistency models 
  • Attiya H, Welch J L. Sequential consistency versus linearizability[J]. ACM Transactions on Computer Systems (TOCS), 1994, 12(2): 91-122.
  • Ongaro D, Ousterhout J. In search of an understandable consensus algorithm[C]//2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14). 2014: 305-319. [raft paper]

“Alibaba Cloud native micro-channel public number (ID: Alicloudnative) focus on micro service, Serverless, container, Service Mesh and other technical fields, focusing popular technology trends in cloud native, cloud native large-scale landing practice, do most understand cloud native developers technology public number. “

Leave a Reply