Quantifying molecular bias in DNA data storage.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
29 06 2020
Historique:
received: 08 10 2019
accepted: 19 05 2020
entrez: 1 7 2020
pubmed: 1 7 2020
medline: 29 8 2020
Statut: epublish

Résumé

DNA has recently emerged as an attractive medium for archival data storage. Recent work has demonstrated proof-of-principle prototype systems; however, very uneven (biased) sequencing coverage has been reported, which indicates inefficiencies in the storage process. Deviations from the average coverage in the sequence copy distribution can either cause wasteful provisioning in sequencing or excessive number of missing sequences. Here, we use millions of unique sequences from a DNA-based digital data archival system to study the oligonucleotide copy unevenness problem and show that the two paramount sources of bias are the synthesis and amplification (PCR) processes. Based on these findings, we develop a statistical model for each molecular process as well as the overall process. We further use our model to explore the trade-offs between synthesis bias, storage physical density, logical redundancy, and sequencing redundancy, providing insights for engineering efficient, robust DNA data storage systems.

Identifiants

pubmed: 32601272
doi: 10.1038/s41467-020-16958-3
pii: 10.1038/s41467-020-16958-3
pmc: PMC7324401
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

3264

Références

Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nat. Mater. 15, 366–370 (2016).
doi: 10.1038/nmat4594
Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).
doi: 10.1038/s41576-019-0125-3
Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
doi: 10.1016/S0167-7799(01)01671-7
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
doi: 10.1038/nmeth.2918
Church, G. M., Gao, Y. & Kosuri, S. Next-Generation Digital Information Storage in DNA. Science 337, 1628–1628 (2012).
doi: 10.1126/science.1226355
Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).
doi: 10.1038/nature11875
Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015).
doi: 10.1002/anie.201411378
Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).
doi: 10.1038/nbt.4079
Yazdi, S. M. H. T., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).
doi: 10.1038/srep14138
Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).
doi: 10.1126/science.aaj2038
Bornholt, J. et al. A DNA-based archival storage system. ACM SIGOPS Oper. Syst. Rev. 50, 637–649 (2016).
doi: 10.1145/2954680.2872397
Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).
doi: 10.1038/s41598-017-05188-1
Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019).
doi: 10.1038/s41598-019-45832-6
Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2012).
doi: 10.1038/nmeth.1778
Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
Dabney, J. & Meyer, M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques 52, 87–94 (2012).
doi: 10.2144/000113809
Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).
Jagers, P. & Klebaner, F. Random variation and concentration effects in PCR. J. Theor. Biol. 224, 299–304 (2003).
doi: 10.1016/S0022-5193(03)00166-8
Stolovitzky, G. & Cecchi, G. Efficiency of DNA replication in the polymerase chain reaction. Proc. Natl Acad. Sci. USA 93, 12947–12952 (1996).
doi: 10.1073/pnas.93.23.12947
Hassibi, A., Kakavand, H. & Lee, T. A stochastic model and simulation algorithm for polymerase chain reaction (PCR) systems. In Proc. of IEEE Workshop on Genomics Signal Processing and Statistics (IEEE, 2004).
Piau, D. Confidence intervals for nonhomogeneous branching processes and polymerase chain reactions. Ann. Probab. 33, 674–702 (2005).
doi: 10.1214/009117904000000775
Lalam, N., Jacob, C. & Jagers, P. Modelling the PCR amplification process by a size-dependent branching process and estimation of the efficiency. Adv. Appl. Probab. 36, 602–615 (2004).
doi: 10.1239/aap/1086957587
Peccoud, J. & Jacob, C. Theoretical uncertainty of measurements using quantitative polymerase chain reaction. Biophys. J. 71, 101–108 (1996).
doi: 10.1016/S0006-3495(96)79205-6
Kebschull, J. M. & Zador, A. M. Sources of PCR-induced distortions in high-throughput sequencing data sets. Nucleic Acids Res. 43, e143 (2015).
doi: 10.1093/nar/gku1263
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
doi: 10.1093/bioinformatics/btp698
Quail, M. A. et al. Optimal enzymes for amplifying sequencing libraries. Nat. Methods 9, 10–11 (2012).
doi: 10.1038/nmeth.1814
Chen, Y., Liu, T., Yu, C., Chiang, T. & Hwang, C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8, e62856 (2013).
doi: 10.1371/journal.pone.0062856
Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).
doi: 10.1093/nar/gks001
Organick, L. et al. Probing the physical limits of reliable DNA data retrieval. Nat. Commun. 11, 1–7 (2020).

Auteurs

Yuan-Jyue Chen (YJ)

Microsoft Research, Redmond, Washington, 98052, USA. yuanjc@microsoft.com.

Christopher N Takahashi (CN)

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA.

Lee Organick (L)

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA.

Callista Bee (C)

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA.

Siena Dumas Ang (SD)

Microsoft Research, Redmond, Washington, 98052, USA.

Patrick Weiss (P)

Twist Bioscience, San Francisco, California, 94158, USA.

Bill Peck (B)

Twist Bioscience, San Francisco, California, 94158, USA.

Georg Seelig (G)

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA.
Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, 98195, USA.

Luis Ceze (L)

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington, 98195, USA. luisceze@cs.washington.edu.

Karin Strauss (K)

Microsoft Research, Redmond, Washington, 98052, USA. kstrauss@microsoft.com.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing
Coal Metagenome Phylogeny Bacteria Genome, Bacterial
Vancomycin Polyesters Anti-Bacterial Agents Models, Theoretical Drug Liberation

Classifications MeSH