DNA storage approaching the information-theoretic ceiling

2026-04-22Information Theory

Information Theory
AI summary

The authors developed a new way to store data in synthetic DNA that improves how errors from reading and writing the DNA are handled. Instead of just making guesses about each DNA letter, their method keeps track of how confident the reading device is for every position, which helps correct mistakes better. Their approach achieves higher data storage density and longer reliable storage times than previous methods. This work brings DNA data storage closer to the theoretical maximum efficiency known from information theory.

synthetic DNAdata storage densityerror-correction codessequencing errorsposterior distributionshidden Markov modeldecoderdepurination kineticsShannon bound
Authors
James L. Banal
Abstract
Synthetic DNA approaches 227.5 exabytes per gram of storage density with stability over millennial timescales. Realising this capacity requires error-correction codes that recover data from substantial synthesis and sequencing errors. Existing codecs convert noisy sequencer output into discrete base calls before error correction, discarding probabilistic information about which positions are reliable. Here we present a coding scheme that retains the sequencer's per-position posterior distributions through an integrated decoder of profile hidden Markov model alignment, log-product fusion across reads, and ordered-statistics decoding. On the DT4DDS channel simulator, the codec recovers 155.8 and 25.9 exabytes per gram of dsDNA under high- and low-fidelity conditions, exceeding the highest prior-art density on each channel by 11 and 52 percent. Under a single-encode-then-degrade protocol mapped to depurination kinetics at 25 °C in the dry state, the codec projects 282 years of decodable storage at 17.1 exabytes per gram. These results place DNA storage density within reach of the Shannon bound of the underlying channel.