Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers.
Deep learning
Enhancer
Gene expression
Promoter
Transcription
Variant effect
Journal
Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660
Informations de publication
Date de publication:
27 03 2023
27 03 2023
Historique:
received:
14
09
2022
accepted:
16
03
2023
medline:
29
3
2023
entrez:
27
3
2023
pubmed:
28
3
2023
Statut:
epublish
Résumé
The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the extent to which those models capture genuine causal signals. Here we confront predictions of state-of-the-art models of transcription regulation against data from two large-scale observational studies and five deep perturbation assays. The most advanced of these sequence-based models, Enformer, by and large, captures causal determinants of human promoters. However, models fail to capture the causal effects of enhancers on expression, notably in medium to long distances and particularly for highly expressed promoters. More generally, the predicted impact of distal elements on gene expression predictions is small and the ability to correctly integrate long-range information is significantly more limited than the receptive fields of the models suggest. This is likely caused by the escalating class imbalance between actual and candidate regulatory elements as distance increases. Our results suggest that sequence-based models have advanced to the point that in silico study of promoter regions and promoter variants can provide meaningful insights and we provide practical guidance on how to use them. Moreover, we foresee that it will require significantly more and particularly new kinds of data to train models accurately accounting for distal elements.
Sections du résumé
BACKGROUND
The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the extent to which those models capture genuine causal signals.
RESULTS
Here we confront predictions of state-of-the-art models of transcription regulation against data from two large-scale observational studies and five deep perturbation assays. The most advanced of these sequence-based models, Enformer, by and large, captures causal determinants of human promoters. However, models fail to capture the causal effects of enhancers on expression, notably in medium to long distances and particularly for highly expressed promoters. More generally, the predicted impact of distal elements on gene expression predictions is small and the ability to correctly integrate long-range information is significantly more limited than the receptive fields of the models suggest. This is likely caused by the escalating class imbalance between actual and candidate regulatory elements as distance increases.
CONCLUSIONS
Our results suggest that sequence-based models have advanced to the point that in silico study of promoter regions and promoter variants can provide meaningful insights and we provide practical guidance on how to use them. Moreover, we foresee that it will require significantly more and particularly new kinds of data to train models accurately accounting for distal elements.
Identifiants
pubmed: 36973806
doi: 10.1186/s13059-023-02899-9
pii: 10.1186/s13059-023-02899-9
pmc: PMC10045630
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
56Informations de copyright
© 2023. The Author(s).
Références
Nat Genet. 2022 Jul;54(7):940-949
pubmed: 35817977
Nat Methods. 2020 Nov;17(11):1118-1124
pubmed: 33046896
Mol Syst Biol. 2019 Feb 18;15(2):e8513
pubmed: 30777893
Nucleic Acids Res. 2019 Jan 8;47(D1):D711-D715
pubmed: 30357387
Nat Methods. 2020 Nov;17(11):1111-1117
pubmed: 33046897
Nat Commun. 2021 May 12;12(1):2751
pubmed: 33980847
Nat Genet. 2021 Sep;53(9):1290-1299
pubmed: 34493866
Nature. 2020 Jul;583(7818):699-710
pubmed: 32728249
Genome Res. 2018 May;28(5):739-750
pubmed: 29588361
PLoS Genet. 2022 Jul 19;18(7):e1010299
pubmed: 35853082
Nucleic Acids Res. 2020 Jan 8;48(D1):D882-D889
pubmed: 31713622
Nature. 2022 Mar;603(7901):455-463
pubmed: 35264797
Nat Genet. 2022 May;54(5):613-624
pubmed: 35551305
Nat Methods. 2021 Oct;18(10):1196-1203
pubmed: 34608324
Nature. 2022 Apr;604(7906):571-577
pubmed: 35418676
Cell Rep. 2020 May 19;31(7):107663
pubmed: 32433972
Bioinformatics. 2023 Feb 3;39(2):
pubmed: 36708003
Genome Biol. 2023 Mar 27;24(1):56
pubmed: 36973806
Science. 2020 Sep 11;369(6509):1318-1330
pubmed: 32913098
Nat Genet. 2021 Mar;53(3):354-366
pubmed: 33603233
Bioinformatics. 2020 Feb 1;36(3):918-919
pubmed: 31373614
Science. 2020 Sep 11;369(6509):
pubmed: 32913073
Nat Genet. 2022 May;54(5):725-734
pubmed: 35551308
Nat Genet. 2018 Aug;50(8):1171-1179
pubmed: 30013180
Genome Res. 2022 Jan;32(1):85-96
pubmed: 34961747
Nat Genet. 2013 Jun;45(6):580-5
pubmed: 23715323
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Genome Res. 2019 Feb;29(2):171-183
pubmed: 30622120
Cell. 2019 Mar 7;176(6):1516
pubmed: 30849375
Sci Data. 2017 Aug 29;4:170112
pubmed: 28850106
Cell. 2019 Jan 10;176(1-2):377-390.e19
pubmed: 30612741
Nat Methods. 2015 Oct;12(10):931-4
pubmed: 26301843
Nat Genet. 2019 Dec;51(12):1664-1669
pubmed: 31784727
Nucleic Acids Res. 2022 Jan 7;50(D1):D988-D995
pubmed: 34791404
Cell. 2020 Jan 23;180(2):248-262.e21
pubmed: 31978344
Nature. 2011 May 19;473(7347):337-42
pubmed: 21593866
Nat Genet. 2021 Sep;53(9):1300-1310
pubmed: 34475573
Cell. 2019 Jun 27;178(1):91-106.e23
pubmed: 31178116
PLoS Comput Biol. 2020 Jul 20;16(7):e1008050
pubmed: 32687525
Nature. 2019 Jul;571(7766):505-509
pubmed: 31243369
Nat Biotechnol. 2019 Jun;37(6):592-600
pubmed: 31138913
Nat Commun. 2019 Aug 8;10(1):3583
pubmed: 31395865
Nature. 2022 Jul;607(7917):176-184
pubmed: 35594906