A deep learning model for predicting next-generation sequencing depth from DNA sequence.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
19 07 2021
Historique:
received: 23 06 2020
accepted: 17 06 2021
entrez: 20 7 2021
pubmed: 21 7 2021
medline: 3 8 2021
Statut: epublish

Résumé

Targeted high-throughput DNA sequencing is a primary approach for genomics and molecular diagnostics, and more recently as a readout for DNA information storage. Oligonucleotide probes used to enrich gene loci of interest have different hybridization kinetics, resulting in non-uniform coverage that increases sequencing costs and decreases sequencing sensitivities. Here, we present a deep learning model (DLM) for predicting Next-Generation Sequencing (NGS) depth from DNA probe sequences. Our DLM includes a bidirectional recurrent neural network that takes as input both DNA nucleotide identities as well as the calculated probability of the nucleotide being unpaired. We apply our DLM to three different NGS panels: a 39,145-plex panel for human single nucleotide polymorphisms (SNP), a 2000-plex panel for human long non-coding RNA (lncRNA), and a 7373-plex panel targeting non-human sequences for DNA information storage. In cross-validation, our DLM predicts sequencing depth to within a factor of 3 with 93% accuracy for the SNP panel, and 99% accuracy for the non-human panel. In independent testing, the DLM predicts the lncRNA panel with 89% accuracy when trained on the SNP panel. The same model is also effective at predicting the measured single-plex kinetic rate constants of DNA hybridization and strand displacement.

Identifiants

pubmed: 34282137
doi: 10.1038/s41467-021-24497-8
pii: 10.1038/s41467-021-24497-8
pmc: PMC8290051
doi:

Substances chimiques

DNA Probes 0
DNA 9007-49-2

Types de publication

Journal Article Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

4387

Subventions

Organisme : NHGRI NIH HHS
ID : R01 HG008752
Pays : United States
Organisme : NCI NIH HHS
ID : U01 CA233364
Pays : United States

Commentaires et corrections

Type : CommentIn

Informations de copyright

© 2021. The Author(s).

Références

Annu Rev Biophys Biomol Struct. 2004;33:415-40
pubmed: 15139820
J Comput Chem. 2011 Jan 15;32(1):170-3
pubmed: 20645303
Science. 2018 Feb 23;359(6378):926-930
pubmed: 29348365
Nat Chem. 2015 Jul;7(7):545-53
pubmed: 26100802
Clin Biochem Rev. 2011 Nov;32(4):177-95
pubmed: 22147957
Physiol Rev. 2016 Oct;96(4):1297-325
pubmed: 27535639
Nat Biotechnol. 2009 Feb;27(2):182-9
pubmed: 19182786
Nat Methods. 2012 Mar 04;9(4):357-9
pubmed: 22388286
Nat Chem. 2015 Jul;7(7):569-75
pubmed: 26100805
Hum Genet. 1980;56(1):71-9
pubmed: 7203483
Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
Chem Rev. 2019 May 22;119(10):6326-6369
pubmed: 30714375
Proc Natl Acad Sci U S A. 2006 Mar 7;103(10):3534-9
pubmed: 16505363
Nat Rev Genet. 2019 Aug;20(8):456-466
pubmed: 31068682
J Mol Biol. 2000 Mar 24;297(2):511-20
pubmed: 10715217
Nat Chem. 2018 Jan;10(1):91-98
pubmed: 29256499
Cell. 2015 Mar 12;160(6):1111-24
pubmed: 25768907
Curr Opin Struct Biol. 1996 Jun;6(3):299-304
pubmed: 8804832
Methods. 2010 Apr;50(4):S1-5
pubmed: 20215014
Chem Biol. 2012 Jan 27;19(1):60-71
pubmed: 22284355
Clin Chem. 2007 Jun;53(6):1084-91
pubmed: 17463177
Biotechnology (N Y). 1993 Sep;11(9):1026-30
pubmed: 7764001
J Am Chem Soc. 2009 Dec 2;131(47):17303-14
pubmed: 19894722
Nature. 2011 Feb 10;470(7333):198-203
pubmed: 21307932
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Nat Methods. 2010 Feb;7(2):111-8
pubmed: 20111037
Nat Commun. 2018 Jun 25;9(1):2467
pubmed: 29941961

Auteurs

Jinny X Zhang (JX)

Department of Bioengineering, Rice University, Houston, TX, USA.
Systems, Synthetic, and Physical Biology, Rice University, Houston, TX, USA.

Boyan Yordanov (B)

Microsoft Research, Cambridge, UK.
Scientific Technologies, London, UK.

Alexander Gaunt (A)

Microsoft Research, Cambridge, UK.

Michael X Wang (MX)

Department of Bioengineering, Rice University, Houston, TX, USA.

Peng Dai (P)

Department of Bioengineering, Rice University, Houston, TX, USA.

Yuan-Jyue Chen (YJ)

Microsoft Research, Seattle, WA, USA.

Kerou Zhang (K)

Department of Bioengineering, Rice University, Houston, TX, USA.

John Z Fang (JZ)

Department of Bioengineering, Rice University, Houston, TX, USA.

Neil Dalchau (N)

Microsoft Research, Cambridge, UK.

Jiaming Li (J)

Department of Bioengineering, Rice University, Houston, TX, USA.
Systems, Synthetic, and Physical Biology, Rice University, Houston, TX, USA.

Andrew Phillips (A)

Microsoft Research, Cambridge, UK. andrew.phillips@microsoft.com.

David Yu Zhang (DY)

Department of Bioengineering, Rice University, Houston, TX, USA. dyz1@rice.edu.
Systems, Synthetic, and Physical Biology, Rice University, Houston, TX, USA. dyz1@rice.edu.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C

Classifications MeSH