O^3-LSM: Maximizing Disaggregated LSM Write Performance via Three-Layer Offloading

2026-03-05Databases

Databases
AI summary

The authors propose O3-LSM, a new design for a key-value storage system that improves how data is written and managed across separated compute and storage parts. They add three layers of offloading (memtable offloading, flush offloading, and compaction offloading) to better use shared disaggregated memory, allowing faster and more efficient data handling. Their design also divides data into smaller pieces (shards) to enable more parallel work and uses a special caching method to speed up data reads. Their tests show O3-LSM performs much better in writing, reading, and latency compared to other similar systems.

Log-Structured Merge-tree (LSM)Key-Value Store (KVS)Disaggregated StorageMemtableCompaction OffloadingFlush OffloadingShard-Level OptimizationDisaggregated Memory (DM)Cache-Enhanced Read DelegationRange Query
Authors
Qi Lin, Gangqi Huang, Te Guo, Chang Guo, Viraj Thakkar, Zichen Zhu, Jianguo Wang, Zhichao Cao
Abstract
Log-Structured Merge-tree-based Key-Value Stores (LSM-KVS) have been optimized and redesigned for disaggregated storage via techniques such as compaction offloading to reduce the network I/Os between compute and storage. However, the constrained memory space and slow flush at the compute node severely limit the overall write throughput of existing optimizations. In this paper, we propose O3-LSM, a fundamental new LSM-KVS architecture, that leverages the shared Disaggregated Memory (DM) to support a three-layer offloading, i.e., memtable Offloading, flush Offloading, and the existing compaction Offloading. Compared to the existing disaggregated LSM-KVS with compaction offloading only, O3-LSM maximizes the write performance by addressing the above issues. O3-LSM first leverages a novel DM-Optimized Memtable to achieve dynamic memtable offloading, which extends the write buffer while enabling fast, asynchronous, and parallel memtable transmission. Second, we propose Collaborative Flush Offloading that decouples the flush control plane from execution and supports memtable flush offloading at any node with dedicated scheduling and global optimizations. Third, O3-LSM is further improved with the Shard-Level Optimization, which partitions the memtable into shards based on disjoint key-ranges that can be transferred and flushed independently, unlocking parallelism across shards. Besides, to mitigate slow lookups in the disaggregated setting, O3-LSM also employs an adaptive Cache-Enhanced Read Delegation mechanism to combine a compact local cache with DM-assisted memtable delegated read. Our evaluation shows that O3-LSM achieves up to 4.5X write, 5.2X range query, and 1.8X point lookup throughput improvement, and up to 76% P99 latency reduction compared with Disaggregated-RocksDB, CaaS-LSM, and Nova-LSM.