EvoLSTM: context-dependent models of sequence evolution using a sequence-to-sequence LSTM.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
01 07 2020
Historique:
entrez: 14 7 2020
pubmed: 14 7 2020
medline: 9 3 2021
Statut: ppublish

Résumé

Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood. We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes. Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 32657367
pii: 5870475
doi: 10.1093/bioinformatics/btaa447
pmc: PMC7355264
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

i353-i361

Informations de copyright

© The Author(s) 2020. Published by Oxford University Press.

Références

Nat Rev Mol Cell Biol. 2018 Aug;19(8):489-506
pubmed: 29784956
Genome Res. 2004 Apr;14(4):708-15
pubmed: 15060014
Gene. 2005 Mar 14;347(2):207-17
pubmed: 15733531
Mol Biol Evol. 2004 Mar;21(3):468-88
pubmed: 14660683
PLoS One. 2011;6(9):e22594
pubmed: 21949676
Nat Rev Genet. 2005 May;6(5):361-75
pubmed: 15861208
Genome Res. 2014 Dec;24(12):2077-89
pubmed: 25273068
Nat Genet. 2013 Aug;45(8):891-8
pubmed: 23817568
Genome Res. 2004 Dec;14(12):2412-23
pubmed: 15574820
IEEE Trans Neural Netw Learn Syst. 2017 Oct;28(10):2222-2232
pubmed: 27411231
Science. 1981 Jun 19;212(4501):1350-7
pubmed: 6262918
Genome Res. 2007 Dec;17(12):1797-808
pubmed: 17984227
Mol Biol Evol. 2009 Aug;26(8):1879-88
pubmed: 19423664
Mol Biol Evol. 1994 Sep;11(5):725-36
pubmed: 7968486
Bioinformatics. 2005 May 15;21(10):2322-8
pubmed: 15769841
J Clin Microbiol. 2000 Aug;38(8):2923-8
pubmed: 10921952
J Mol Evol. 1980 Dec;16(2):111-20
pubmed: 7463489
Nucleic Acids Res. 1980 Apr 11;8(7):1499-504
pubmed: 6253938
Proc Natl Acad Sci U S A. 2002 Aug 6;99(16):10571-4
pubmed: 12142466
BMC Bioinformatics. 2004 Oct 26;5:166
pubmed: 15507142
Nat Genet. 2016 Apr;48(4):349-55
pubmed: 26878723
Bioinformatics. 2009 Jun 1;25(11):1422-3
pubmed: 19304878
Front Genet. 2015 Oct 26;6:319
pubmed: 26579193
BMC Bioinformatics. 2005 Dec 12;6:298
pubmed: 16343337
J Mol Evol. 1991 Aug;33(2):114-24
pubmed: 1920447
Bioinformatics. 2010 Jan 1;26(1):130-1
pubmed: 19850756
Mol Biol Evol. 2019 May 1;36(5):955-965
pubmed: 30753705
Mol Biol Evol. 2012 Jul;29(7):1769-80
pubmed: 22319143
J Biol Chem. 2002 Apr 12;277(15):12777-83
pubmed: 11821423
Science. 2000 Feb 18;287(5456):1283-6
pubmed: 10678838
J Mol Biol. 1970 Mar;48(3):443-53
pubmed: 5420325
Mol Biol Evol. 2020 Mar 1;37(3):893-903
pubmed: 31651955
Neural Comput. 2000 Oct;12(10):2451-71
pubmed: 11032042
Mol Biol Evol. 2007 May;24(5):1190-7
pubmed: 17322553
J Comput Biol. 2003;10(3-4):313-22
pubmed: 12935330
Microbiology. 1999 Nov;145 ( Pt 11):3169-3176
pubmed: 10589725
Bioinformatics. 1998;14(2):157-63
pubmed: 9545448
Bioinformatics. 2007 May 1;23(9):1073-9
pubmed: 17332019
J Mol Biol. 1981 Mar 25;147(1):195-7
pubmed: 7265238
PLoS One. 2010 Mar 10;5(3):e9490
pubmed: 20224823
Genome Res. 2000 Apr;10(4):577-86
pubmed: 10779500
Nat Rev Genet. 2015 Apr;16(4):213-23
pubmed: 25732611
Gene. 2004 May 26;333:143-9
pubmed: 15177689
J Comput Biol. 2011 Nov;18(11):1449-64
pubmed: 21951055
Genetics. 2017 Feb;205(2):843-856
pubmed: 27974498
Nat Rev Genet. 2014 Sep;15(9):585-98
pubmed: 24981601
J Mol Evol. 2003 May;56(5):616-29
pubmed: 12698298
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Nucleic Acids Res. 2004 Mar 19;32(5):1792-7
pubmed: 15034147

Auteurs

Dongjoon Lim (D)

School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada.

Mathieu Blanchette (M)

School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Amaryllidaceae Alkaloids Lycoris NADPH-Ferrihemoprotein Reductase Gene Expression Regulation, Plant Plant Proteins

Classifications MeSH