Species-aware DNA language models capture regulatory elements and their evolution.


Journal

Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660

Informations de publication

Date de publication:
02 Apr 2024
Historique:
received: 14 08 2023
accepted: 20 03 2024
medline: 3 4 2024
pubmed: 3 4 2024
entrez: 2 4 2024
Statut: epublish

Résumé

The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.

Sections du résumé

BACKGROUND BACKGROUND
The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution.
RESULTS RESULTS
Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery.
CONCLUSIONS CONCLUSIONS
Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.

Identifiants

pubmed: 38566111
doi: 10.1186/s13059-024-03221-x
pii: 10.1186/s13059-024-03221-x
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

83

Subventions

Organisme : Bundesministerium für Bildung und Forschung
ID : 031L0174A

Informations de copyright

© 2024. The Author(s).

Références

Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.
doi: 10.1038/nature11247
Noguchi S, Arakawa T, Fukuda S, Furuno M, Hasegawa A, Hori F, et al. FANTOM5 CAGE profiles of human and mouse samples. Sci Data. 2017;4:170112.
pubmed: 28850106 pmcid: 5574368 doi: 10.1038/sdata.2017.112
Mora C, Tittensor DP, Adl S, Simpson AGB, Worm B. How many species are there on Earth and in the ocean? PLOS Biol. 2011;9:e1001127.
pubmed: 21886479 pmcid: 3160336 doi: 10.1371/journal.pbio.1001127
Blaxter M, Archibald JM, Childers AK, Coddington JA, Crandall KA, Di Palma F, et al. Why sequence all eukaryotes? Proc Natl Acad Sci. 2022;119:e2115636118.
pubmed: 35042801 pmcid: 8795522 doi: 10.1073/pnas.2115636118
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592:737–46.
pubmed: 33911273 pmcid: 8081667 doi: 10.1038/s41586-021-03451-0
Kuderna LFK, Gao H, Janiak MC, Kuhlwilm M, Orkin JD, Bataillon T, et al. A global catalog of whole-genome diversity from 233 primate species. Science. 2023;380:906–13.
pubmed: 37262161 doi: 10.1126/science.abn7829
Osmanski AB, Paulat NS, Korstian J, Grimshaw JR, Halsey M, Sullivan KAM, et al. Insights into mammalian TE diversity through the curation of 248 genome assemblies. Science. 2023;380:eabn1430.
pubmed: 37104570 doi: 10.1126/science.abn1430
Zhang G, Li C, Li Q, Li B, Larkin DM, Lee C, et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science. 2014;346:1311–20.
pubmed: 25504712 pmcid: 4390078 doi: 10.1126/science.1251385
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–54.
pubmed: 12748633 doi: 10.1038/nature01644
Kimura M. Evolutionary rate at the molecular level. Nature. 1968;217:624–6.
pubmed: 5637732 doi: 10.1038/217624a0
Weirauch MT, Hughes TR. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. 2010;26:66–74.
pubmed: 20083321 doi: 10.1016/j.tig.2009.12.002
Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLOS Genet. 2008;4:e1000106.
pubmed: 18584029 pmcid: 2430619 doi: 10.1371/journal.pgen.1000106
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv; 2019. Available from: http://arxiv.org/abs/1810.04805 . Cited 2023 Jan 18.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20.
pubmed: 33538820 doi: 10.1093/bioinformatics/btab083
Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: efficient foundation model and benchmark for multi-species genome. arXiv; 2023. Available from: http://arxiv.org/abs/2306.15006 . Cited 2023 Jul 22.
Dalla-Torre H, Gonzalez L, Revilla JM, Carranza NL, Grzywaczewski AH, Oteri F, et al. The nucleotide transformer: building and evaluating robust foundation models for human genomics. bioRxiv; 2023. p. 2023.01.11.523679. Available from: https://www.biorxiv.org/content/10.1101/2023.01.11.523679v1 . Cited 2023 Jan 19.
Fishman V, Kuratov Y, Petrov M, Shmelev A, Shepelin D, Chekanov N, et al. GENA-LM: a family of open-source foundational models for long DNA sequences. bioRxiv; 2023. p. 2023.06.12.544594. Available from: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1 . Cited 2023 Jul 22.
Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics. 2006;22:2971–2.
pubmed: 17021158 doi: 10.1093/bioinformatics/btl505
Benegas G, Batra SS, Song YS. DNA language models are powerful zero-shot predictors of genome-wide variant effects. bioRxiv; 2023. p. 2022.08.22.504706. Available from: https://www.biorxiv.org/content/10.1101/2022.08.22.504706v2 . Cited 2023 Jul 22.
Prieto M, Wedin M. Dating the diversification of the major lineages of Ascomycota (Fungi). PLoS One. 2013;8:e65576.
pubmed: 23799026 pmcid: 3683012 doi: 10.1371/journal.pone.0065576
Wilinski D, Buter N, Klocko AD, Lapointe CP, Selker EU, Gasch AP, et al. Recurrent rewiring and emergence of RNA regulatory networks. Proc Natl Acad Sci. 2017;114:E2816–25.
pubmed: 28320951 pmcid: 5389312 doi: 10.1073/pnas.1617777114
Tanay A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 2006;16:962–72.
pubmed: 16809671 pmcid: 1524868 doi: 10.1101/gr.5113606
Ward LD, Bussemaker HJ. Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics. 2008;24:i165–71.
pubmed: 18586710 pmcid: 2718632 doi: 10.1093/bioinformatics/btn154
Wolfertstetter F, Frech K, Herrmann G, Werner T. Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm. Bioinformatics. 1996;12:71–80.
doi: 10.1093/bioinformatics/12.1.71
Elemento O, Tavazoie S. Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol. 2005;6:R18.
pubmed: 15693947 pmcid: 551538 doi: 10.1186/gb-2005-6-2-r18
Bussemaker HJ, Li H, Siggia ED. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci. 2000;97:10096–100.
pubmed: 10944202 pmcid: 27717 doi: 10.1073/pnas.180265397
Gordân R, Narlikar L, Hartemink AJ. Finding regulatory DNA motifs using alignment-free evolutionary conservation information. Nucleic Acids Res. 2010;38:e90.
pubmed: 20047961 pmcid: 2847231 doi: 10.1093/nar/gkp1166
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18:186.
pubmed: 28974235 pmcid: 5627421 doi: 10.1186/s13059-017-1319-7
Lu Z, Lin Z. The origin and evolution of a distinct mechanism of transcription initiation in yeasts. Genome Res. 2021;31:51-63.
Pelechano V, Wei W, Steinmetz LM. Extensive transcriptional heterogeneity revealed by isoform profiling. Nature. 2013;497:127–31.
pubmed: 23615609 pmcid: 3705217 doi: 10.1038/nature12121
Sahu B, Hartonen T, Pihlajamaa P, Wei B, Dave K, Zhu F, et al. Sequence determinants of human gene regulatory elements. Nat Genet. 2022;54:283–94.
pubmed: 35190730 pmcid: 8920891 doi: 10.1038/s41588-021-01009-4
Shrikumar A, Tian K, Avsec Ž, Shcherbina A, Banerjee A, Sharmin M, et al. Technical note on Transcription Factor Motif Discovery from Importance Scores (TF-MoDISco) version 0.5.6.5. arXiv; 2020. Available from: http://arxiv.org/abs/1811.00416 . Cited 2022 Sep 25.
Bailey TL. STREME: accurate and versatile sequence motif discovery. Bioinformatics. 2021;37:2834–40.
pubmed: 33760053 pmcid: 8479671 doi: 10.1093/bioinformatics/btab203
de Boer CG, Hughes TR. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 2012;40:D169–79.
pubmed: 22102575 doi: 10.1093/nar/gkr993
Yang A, Zhu Z, Kapranov P, McKeon F, Church GM, Gingeras TR, et al. Relationships between p63 binding, DNA sequence, transcription activity, and biological function in human cells. Mol Cell. 2006;24:593–602.
pubmed: 17188034 doi: 10.1016/j.molcel.2006.10.018
Rossi MJ, Lai WKM, Pugh BF. Genome-wide determinants of sequence-specific DNA binding of general regulatory factors. Genome Res. 2018;28:497–508.
pubmed: 29563167 pmcid: 5880240 doi: 10.1101/gr.229518.117
Gordân R, Shen N, Dror I, Zhou T, Horton J, Rohs R, et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 2013;3:1093–104.
pubmed: 23562153 pmcid: 3640701 doi: 10.1016/j.celrep.2013.03.014
Erb I, van Nimwegen E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PLoS ONE. 2011;6:e24279.
pubmed: 21931670 pmcid: 3170328 doi: 10.1371/journal.pone.0024279
McMillan J, Lu Z, Rodriguez JS, Ahn T-H, Lin Z. YeasTSS: an integrative web database of yeast transcription start sites. Database. 2019;2019:baz048.
pubmed: 31032841 pmcid: 6484093 doi: 10.1093/database/baz048
Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–50.
pubmed: 16024819 pmcid: 1182216 doi: 10.1101/gr.3715005
Lapointe CP, Stefely JA, Jochem A, Hutchins PD, Wilson GM, Kwiecien NW, et al. Multi-omics reveal specific targets of the RNA-binding protein Puf3p and its orchestration of mitochondrial biogenesis. Cell Syst. 2018;6:125–135.e6.
pubmed: 29248374 doi: 10.1016/j.cels.2017.11.012
Rossi MJ, Kuntala PK, Lai WKM, Yamada N, Badjatia N, Mittal C, et al. A high-resolution protein architecture of the budding yeast genome. Nature. 2021;592:309–14.
pubmed: 33692541 pmcid: 8035251 doi: 10.1038/s41586-021-03314-8
Lieb JD, Liu X, Botstein D, Brown PO. Promoter-specific binding of Rap1 revealed by genome-wide maps of protein–DNA association. Nat Genet. 2001;28:327–34.
pubmed: 11455386 doi: 10.1038/ng569
Tanay A, Regev A, Shamir R. Conservation and evolvability in regulatory networks: the evolution of ribosomal regulation in yeast. Proc Natl Acad Sci. 2005;102:7203–8.
pubmed: 15883364 pmcid: 1091753 doi: 10.1073/pnas.0502521102
Hogan GJ, Brown PO, Herschlag D. Evolutionary conservation and diversification of Puf RNA binding proteins and their mRNA targets. PLOS Biol. 2015;13:e1002307.
pubmed: 26587879 pmcid: 4654594 doi: 10.1371/journal.pbio.1002307
Li B, Oestreich S, de Lange T. Identification of human Rap1: implications for telomere evolution. Cell. 2000;101:471–83.
pubmed: 10850490 doi: 10.1016/S0092-8674(00)80858-2
Kramara J, Willcox S, Gunisova S, Kinsky S, Nosek J, Griffith JD, et al. Tay1 protein, a novel telomere binding factor from Yarrowia lipolytica*. J Biol Chem. 2010;285:38078–92.
pubmed: 20923774 pmcid: 2992242 doi: 10.1074/jbc.M110.127605
Tsankov AM, Thompson DA, Socha A, Regev A, Rando OJ. The role of nucleosome positioning in the evolution of gene regulation. PLOS Biol. 2010;8:e1000414.
pubmed: 20625544 pmcid: 2897762 doi: 10.1371/journal.pbio.1000414
Tsankov A, Yanagisawa Y, Rhind N, Regev A, Rando OJ. Evolutionary divergence of intrinsic and trans-regulated nucleosome positioning sequences reveals plastic rules for chromatin organization. Genome Res. 2011;21:1851–62.
pubmed: 21914852 pmcid: 3205570 doi: 10.1101/gr.122267.111
Cheng J, Maier KC, Avsec Ž, Rus P, Gagneur J. Cis-regulatory elements explain most of the mRNA stability variation across genes in yeast. RNA. 2017;23:1648–59.
pubmed: 28802259 pmcid: 5648033 doi: 10.1261/rna.062224.117
Sun M, Schwalb B, Pirkl N, Maier KC, Schenk A, Failmezger H, et al. Global analysis of eukaryotic mRNA degradation reveals Xrn1-dependent buffering of transcript levels. Mol Cell. 2013;52:52–62.
pubmed: 24119399 doi: 10.1016/j.molcel.2013.09.010
Eser P, Wachutka L, Maier KC, Demel C, Boroni M, Iyer S, et al. Determinants of RNA metabolism in the Schizosaccharomyces pombe genome. Mol Syst Biol. 2016;12:857.
pubmed: 26883383 pmcid: 4770384 doi: 10.15252/msb.20156526
Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R, Siewers V, et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat Commun. 2020;11:6141.
pubmed: 33262328 pmcid: 7708451 doi: 10.1038/s41467-020-19921-4
Thompson DA, Roy S, Chan M, Styczynsky MP, Pfiffner J, French C, et al. Evolutionary principles of modular gene regulation in yeasts. Tautz D, editor. Elife. 2013;2:e00603.
pubmed: 23795289 pmcid: 3687341 doi: 10.7554/eLife.00603
Shalem O, Sharon E, Lubliner S, Regev I, Lotan-Pompan M, Yakhini Z, et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLOS Genet. 2015;11:e1005147.
pubmed: 25875337 pmcid: 4398552 doi: 10.1371/journal.pgen.1005147
Yamanishi M, Ito Y, Kintaka R, Imamura C, Katahira S, Ikeuchi A, et al. A genome-wide activity assessment of terminator regions in Saccharomyces cerevisiae provides a “terminatome” toolbox. ACS Synth Biol. 2013;2:337–47.
pubmed: 23654277 doi: 10.1021/sb300116y
Keren L, Zackay O, Lotan-Pompan M, Barenholz U, Dekel E, Sasson V, et al. Promoters maintain their relative activity levels under different growth conditions. Mol Syst Biol. 2013;9:701.
pubmed: 24169404 pmcid: 3817408 doi: 10.1038/msb.2013.59
Fischer AD, Olivas WM. Multiple Puf proteins regulate the stability of ribosome biogenesis transcripts. RNA Biol. 2018;15:1228–43.
pubmed: 30251908 pmcid: 6284577 doi: 10.1080/15476286.2018.1521211
Gu A, Johnson I, Goel K, Saab K, Dao T, Rudra A, et al. Combining recurrent, convolutional, and continuous-time models with linear state-space layers. arXiv; 2021. Available from: http://arxiv.org/abs/2110.13985 . Cited 2023 Jan 18.
Gupta A, Gu A, Berant J. Diagonal state spaces are as effective as structured state spaces. arXiv; 2022. Available from: http://arxiv.org/abs/2203.14343 . Cited 2023 Jan 18.
Nguyen E, Poli M, Faizi M, Thomas A, Birch-Sykes C, Wornow M, et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. arXiv; 2023. Available from: http://arxiv.org/abs/2306.15794 . Cited 2023 Jul 22.
Marks RA, Hotaling S, Frandsen PB, VanBuren R. Representation and participation across 20 years of plant genome sequencing. Nat Plants. 2021;7:1571–8.
pubmed: 34845350 pmcid: 8677620 doi: 10.1038/s41477-021-01031-8
Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50:D988–95.
pubmed: 34791404 doi: 10.1093/nar/gkab1049
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.
pubmed: 20003500 pmcid: 2803857 doi: 10.1186/1471-2105-10-421
Dao T, Fu DY, Ermon S, Rudra A, Ré C. FlashAttention: fast and memory-efficient exact attention with IO-awareness. arXiv; 2022. Available from: http://arxiv.org/abs/2205.14135 . Cited 2023 Jul 22.
Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv; 2017. Available from: http://arxiv.org/abs/1412.6980 . Cited 2023 Jul 22.
Karollus A, Hingerl J, Gankin D, Grosshauser M, Klemon K, Gagneur J. gagneurlab/SpeciesLM. 2023. Available from: https://github.com/gagneurlab/SpeciesLM .
Karollus A, Hingerl J, Gankin D, Gagneur J. Supporting data for species-aware DNA language models. Zenodo; 2023. Available from: https://zenodo.org/records/8247134 . Cited 2024 Mar 11.
Karollus A, Hingerl J, Gagneur J. Species and agnostic LM. figshare; 2023. Available from: https://figshare.com/articles/code/Species_and_Agnostic_LM/23732655/1 . Cited 2024 Mar 11.
Karollus A, Hingerl J, Gagneur J. gagneurlab/SpeciesLM hugging face. Available from: https://huggingface.co/gagneurlab/SpeciesLM . Cited 2024 Mar 11.

Auteurs

Alexander Karollus (A)

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
Munich Center for Machine Learning, Munich, Germany.

Johannes Hingerl (J)

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.

Dennis Gankin (D)

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.

Martin Grosshauser (M)

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.

Kristian Klemon (K)

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.

Julien Gagneur (J)

School of Computation, Information and Technology, Technical University of Munich, Garching, Germany. gagneur@in.tum.de.
Munich Center for Machine Learning, Munich, Germany. gagneur@in.tum.de.
Institute of Human Genetics, School of Medicine and Health, Technical University of Munich, Munich, Germany. gagneur@in.tum.de.
Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany. gagneur@in.tum.de.
Munich Data Science Institute, Technical University of Munich, Garching, Germany. gagneur@in.tum.de.

Classifications MeSH