Quantifying molecular bias in DNA data storage.

Bias Information Storage and Retrieval Models, Theoretical Sequence Analysis, DNA / methods

Journal

Nature communications

ISSN: 2041-1723

Titre abrégé: Nat Commun

Pays: England

ID NLM: 101528555

Informations de publication

Date de publication:
29 06 2020

Historique:

received: 08 10 2019

accepted: 19 05 2020

entrez: 1 7 2020

pubmed: 1 7 2020

medline: 29 8 2020

Statut: epublish

Résumé

DNA has recently emerged as an attractive medium for archival data storage. Recent work has demonstrated proof-of-principle prototype systems; however, very uneven (biased) sequencing coverage has been reported, which indicates inefficiencies in the storage process. Deviations from the average coverage in the sequence copy distribution can either cause wasteful provisioning in sequencing or excessive number of missing sequences. Here, we use millions of unique sequences from a DNA-based digital data archival system to study the oligonucleotide copy unevenness problem and show that the two paramount sources of bias are the synthesis and amplification (PCR) processes. Based on these findings, we develop a statistical model for each molecular process as well as the overall process. We further use our model to explore the trade-offs between synthesis bias, storage physical density, logical redundancy, and sequencing redundancy, providing insights for engineering efficient, robust DNA data storage systems.

Identifiants

DOI: 10.1038/s41467-020-16958-3 PMID: 32601272 PMC: PMC7324401

pubmed: 32601272

doi: 10.1038/s41467-020-16958-3

pii: 10.1038/s41467-020-16958-3

pmc: PMC7324401

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

3264

Références

Zhirnov, V., Zadegan, R. M., Sandhu, G. S., Church, G. M. & Hughes, W. L. Nucleic acid memory. Nat. Mater. 15, 366–370 (2016).

doi: 10.1038/nmat4594

Ceze, L., Nivala, J. & Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet. 20, 456–466 (2019).

doi: 10.1038/s41576-019-0125-3

Cox, J. P. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).

doi: 10.1016/S0167-7799(01)01671-7

Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

doi: 10.1038/nmeth.2918

Church, G. M., Gao, Y. & Kosuri, S. Next-Generation Digital Information Storage in DNA. Science 337, 1628–1628 (2012).

doi: 10.1126/science.1226355

Goldman, N. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77–80 (2013).

doi: 10.1038/nature11875

Grass, R. N., Heckel, R., Puddu, M., Paunescu, D. & Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed. 54, 2552–2555 (2015).

doi: 10.1002/anie.201411378

Organick, L. et al. Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242–248 (2018).

doi: 10.1038/nbt.4079

Yazdi, S. M. H. T., Yuan, Y., Ma, J., Zhao, H. & Milenkovic, O. A rewritable, random-access DNA-based storage system. Sci. Rep. 5, 14138 (2015).

doi: 10.1038/srep14138

Erlich, Y. & Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science 355, 950–954 (2017).

doi: 10.1126/science.aaj2038

Bornholt, J. et al. A DNA-based archival storage system. ACM SIGOPS Oper. Syst. Rev. 50, 637–649 (2016).

doi: 10.1145/2954680.2872397

Yazdi, S. M. H. T., Gabrys, R. & Milenkovic, O. Portable and error-free DNA-based data storage. Sci. Rep. 7, 5011 (2017).

doi: 10.1038/s41598-017-05188-1

Heckel, R., Mikutis, G. & Grass, R. N. A characterization of the DNA data storage channel. Sci. Rep. 9, 9663 (2019).

doi: 10.1038/s41598-019-45832-6

Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2012).

doi: 10.1038/nmeth.1778

Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

Dabney, J. & Meyer, M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques 52, 87–94 (2012).

doi: 10.2144/000113809

Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011).

Jagers, P. & Klebaner, F. Random variation and concentration effects in PCR. J. Theor. Biol. 224, 299–304 (2003).

doi: 10.1016/S0022-5193(03)00166-8

Stolovitzky, G. & Cecchi, G. Efficiency of DNA replication in the polymerase chain reaction. Proc. Natl Acad. Sci. USA 93, 12947–12952 (1996).

doi: 10.1073/pnas.93.23.12947

Hassibi, A., Kakavand, H. & Lee, T. A stochastic model and simulation algorithm for polymerase chain reaction (PCR) systems. In Proc. of IEEE Workshop on Genomics Signal Processing and Statistics (IEEE, 2004).

Piau, D. Confidence intervals for nonhomogeneous branching processes and polymerase chain reactions. Ann. Probab. 33, 674–702 (2005).

doi: 10.1214/009117904000000775

Lalam, N., Jacob, C. & Jagers, P. Modelling the PCR amplification process by a size-dependent branching process and estimation of the efficiency. Adv. Appl. Probab. 36, 602–615 (2004).

doi: 10.1239/aap/1086957587

Peccoud, J. & Jacob, C. Theoretical uncertainty of measurements using quantitative polymerase chain reaction. Biophys. J. 71, 101–108 (1996).

doi: 10.1016/S0006-3495(96)79205-6

Kebschull, J. M. & Zador, A. M. Sources of PCR-induced distortions in high-throughput sequencing data sets. Nucleic Acids Res. 43, e143 (2015).

doi: 10.1093/nar/gku1263

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

doi: 10.1093/bioinformatics/btp698

Quail, M. A. et al. Optimal enzymes for amplifying sequencing libraries. Nat. Methods 9, 10–11 (2012).

doi: 10.1038/nmeth.1814

Chen, Y., Liu, T., Yu, C., Chiang, T. & Hwang, C. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE 8, e62856 (2013).

doi: 10.1371/journal.pone.0062856

Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

doi: 10.1093/nar/gks001

Organick, L. et al. Probing the physical limits of reliable DNA data retrieval. Nat. Commun. 11, 1–7 (2020).

Quantifying molecular bias in DNA data storage.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Références

Auteurs

Yuan-Jyue Chen (YJ)

Christopher N Takahashi (CN)

Lee Organick (L)

Callista Bee (C)

Siena Dumas Ang (SD)

Patrick Weiss (P)

Bill Peck (B)

Georg Seelig (G)

Luis Ceze (L)

Karin Strauss (K)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

Fasciola hepatica and Fasciola hybrid form co-existence in yak from Tibet of China: application of rDNA internal transcribed spacer.

Comparative genomic analysis and characterization of novel high-quality draft genomes from the coal metagenome.

Mathematical modeling of vancomycin release from Poly-L-Lactic Acid-Coated implants.

Classifications MeSH