Benchmarking and building DNA binding affinity models using allele-specific and allele-agnostic transcription factor binding data.
Allele-specific binding
Biophysically interpretable machine learning
CTCF, EBF1, PU.1/SPI1
ChIP-seq, ChIP-exo, CUT&Tag
Gene expression regulation
Motif discovery
Non-coding variants
Statistical modeling
Transcription factors
Journal
Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660
Informations de publication
Date de publication:
31 Oct 2024
31 Oct 2024
Historique:
received:
15
12
2023
accepted:
17
10
2024
medline:
1
11
2024
pubmed:
1
11
2024
entrez:
1
11
2024
Statut:
epublish
Résumé
Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity manifests itself in vivo as differences in TF occupancy between the two alleles at heterozygous loci. Genome-scale assays such as ChIP-seq currently are limited in their power to detect allele-specific binding (ASB) both in terms of read coverage and representation of individual variants in the cell lines used. This makes prediction of allelic differences in TF binding from sequence alone desirable, provided that the reliability of such predictions can be quantitatively assessed. We here propose methods for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We use a likelihood function based on an over-dispersed binomial distribution to aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. To facilitate the de novo inference of high-quality models from paired-end in vivo binding data such as ChIP-seq, ChIP-exo, and CUT&Tag without read mapping or peak calling, we introduce an extensible reimplementation of our biophysically interpretable machine learning framework named PyProBound. Explicitly accounting for assay-specific bias in DNA fragmentation rate when training on ChIP-seq yields improved TF binding models. Moreover, we show how PyProBound can leverage our threshold-free ASB likelihood function to perform de novo motif discovery using allele-specific ChIP-seq counts. Our work provides new strategies for predicting the functional impact of non-coding variants.
Sections du résumé
BACKGROUND
BACKGROUND
Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity manifests itself in vivo as differences in TF occupancy between the two alleles at heterozygous loci. Genome-scale assays such as ChIP-seq currently are limited in their power to detect allele-specific binding (ASB) both in terms of read coverage and representation of individual variants in the cell lines used. This makes prediction of allelic differences in TF binding from sequence alone desirable, provided that the reliability of such predictions can be quantitatively assessed.
RESULTS
RESULTS
We here propose methods for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We use a likelihood function based on an over-dispersed binomial distribution to aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. To facilitate the de novo inference of high-quality models from paired-end in vivo binding data such as ChIP-seq, ChIP-exo, and CUT&Tag without read mapping or peak calling, we introduce an extensible reimplementation of our biophysically interpretable machine learning framework named PyProBound. Explicitly accounting for assay-specific bias in DNA fragmentation rate when training on ChIP-seq yields improved TF binding models. Moreover, we show how PyProBound can leverage our threshold-free ASB likelihood function to perform de novo motif discovery using allele-specific ChIP-seq counts.
CONCLUSION
CONCLUSIONS
Our work provides new strategies for predicting the functional impact of non-coding variants.
Identifiants
pubmed: 39482734
doi: 10.1186/s13059-024-03424-2
pii: 10.1186/s13059-024-03424-2
doi:
Substances chimiques
Transcription Factors
0
DNA
9007-49-2
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
284Subventions
Organisme : NIMH NIH HHS
ID : R01MH106842
Pays : United States
Informations de copyright
© 2024. The Author(s).
Références
Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, Chen X, Taipale J, Hughes TR, Weirauch MT. The human transcription factors. Cell. 2018;172:650–65.
doi: 10.1016/j.cell.2018.01.029
pubmed: 29425488
Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–5.
doi: 10.1126/science.1222794
pubmed: 22955828
pmcid: 3771521
Deplancke B, Alpern D, Gardeux V. The genetics of transcription factor DNA binding variation. Cell. 2016;166:538–54.
doi: 10.1016/j.cell.2016.07.012
pubmed: 27471964
McDaniell R, Lee BK, Song L, Liu Z, Boyle AP, Erdos MR, Scott LJ, Morken MA, Kucera KS, Battenhouse A, et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science. 2010;328:235–9.
doi: 10.1126/science.1184655
pubmed: 20299549
pmcid: 2929018
Chen J, Rozowsky J, Galeev TR, Harmanci A, Kitchen R, Bedford J, Abyzov A, Kong Y, Regan L, Gerstein M. A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals. Nat Commun. 2016;7:11101.
doi: 10.1038/ncomms11101
pubmed: 27089393
pmcid: 4837449
Cavalli M, Pan G, Nord H, Wallerman O, Wallen Arzt E, Berggren O, Elvers I, Eloranta ML, Ronnblom L, Lindblad Toh K, Wadelius C. Allele-specific transcription factor binding to common and rare variants associated with disease and gene expression. Hum Genet. 2016;135:485–97.
doi: 10.1007/s00439-016-1654-x
pubmed: 26993500
pmcid: 4835527
Abramov S, Boytsov A, Bykova D, Penzar DD, Yevshin I, Kolmykov SK, Fridman MV, Favorov AV, Vorontsov IE, Baulin E, et al. Landscape of allele-specific transcription factor binding in the human genome. Nat Commun. 2021;12:2751.
doi: 10.1038/s41467-021-23007-0
pubmed: 33980847
pmcid: 8115691
Kilpinen H, Waszak SM, Gschwind AR, Raghav SK, Witwicki RM, Orioli A, Migliavacca E, Wiederkehr M, Gutierrez-Arcelus M, Panousis NI, et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science. 2013;342:744–7.
doi: 10.1126/science.1242463
pubmed: 24136355
pmcid: 5502466
Reddy TE, Gertz J, Pauli F, Kucera KS, Varley KE, Newberry KM, Marinov GK, Mortazavi A, Williams BA, Song L, et al. Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 2012;22:860–9.
doi: 10.1101/gr.131201.111
pubmed: 22300769
pmcid: 3337432
Kribelbauer JF, Rastogi C, Bussemaker HJ, Mann RS. Low-affinity binding sites and the transcription factor specificity paradox in eukaryotes. Annu Rev Cell Dev Biol. 2019;35:357–79.
doi: 10.1146/annurev-cellbio-100617-062719
pubmed: 31283382
pmcid: 6787930
Ogawa N, Biggin MD. High-throughput SELEX determination of DNA sequences bound by transcription factors in vitro. Methods Mol Biol. 2012;786:51–63.
doi: 10.1007/978-1-61779-292-2_3
pubmed: 21938619
Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, et al. DNA-binding specificities of human transcription factors. Cell. 2013;152:327–39.
doi: 10.1016/j.cell.2012.12.009
pubmed: 23332764
Isakova A, Groux R, Imbeault M, Rainer P, Alpern D, Dainese R, Ambrosini G, Trono D, Bucher P, Deplancke B. SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nat Methods. 2017;14:316–22.
doi: 10.1038/nmeth.4143
pubmed: 28092692
Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–502.
doi: 10.1126/science.1141319
pubmed: 17540862
Luo Y, Hitz BC, Gabdank I, Hilton JA, Kagda MS, Lam B, Myers Z, Sud P, Jou J, Lin K, et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020;48:D882–9.
doi: 10.1093/nar/gkz1062
pubmed: 31713622
Rossi MJ, Lai WKM, Pugh BF. Simplified ChIP-exo assays. Nat Commun. 2018;9:2842.
doi: 10.1038/s41467-018-05265-7
pubmed: 30030442
pmcid: 6054642
Kaya-Okur HS, Wu SJ, Codomo CA, Pledger ES, Bryson TD, Henikoff JG, Ahmad K, Henikoff S. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat Commun. 1930;2019:10.
Rube HT, Rastogi C, Feng S, Kribelbauer JF, Li A, Becerra B, Melo LAN, Do BV, Li X, Adam HH, et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nat Biotechnol. 2022;40:1520–7.
doi: 10.1038/s41587-022-01307-0
pubmed: 35606422
pmcid: 9546773
Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranasic D, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48:D87–92.
pubmed: 31701148
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33:831–8.
doi: 10.1038/nbt.3300
pubmed: 26213851
Kulakovskiy IV, Vorontsov IE, Yevshin IS, Sharipov RN, Fedorova AD, Rumynskiy EI, Medvedeva YA, Magana-Mora A, Bajic VB, Papatsenko DA, et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018;46:D252–9.
doi: 10.1093/nar/gkx1106
pubmed: 29140464
Foat BC, Morozov AV, Bussemaker HJ. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006;22:e141-149.
doi: 10.1093/bioinformatics/btl223
pubmed: 16873464
Rastogi C, Rube HT, Kribelbauer JF, Crocker J, Loker RE, Martini GD, Laptenko O, Freed-Pastor WA, Prives C, Stern DL, et al. Accurate and sensitive quantification of protein-DNA binding affinity. Proc Natl Acad Sci U S A. 2018;115:E3692–701.
doi: 10.1073/pnas.1714376115
pubmed: 29610332
pmcid: 5910815
Bushnell B, Rood J, Singer E. BBMerge - Accurate paired shotgun read merging via overlap. PLoS ONE. 2017;12:e0185056.
doi: 10.1371/journal.pone.0185056
pubmed: 29073143
pmcid: 5657622
Bushnell B. BBMap: A Fast, Accurate, Splice-Aware Aligner. 2014.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.
doi: 10.1093/bioinformatics/btp352
pubmed: 19505943
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. ENCSR617IFZ. ENCODE. https://www.encodeproject.org/experiments/ENCSR617IFZ/ . 2016.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. ENCSR007GUS. ENCODE. https://www.encodeproject.org/experiments/ENCSR007GUS/ . 2016.
Rossi MJ, Lai WKM, Pugh BF. Simplified ChIP-exo assays. GSE110681. Gene Expression Omnibus (GEO). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110681 . 2018.
Kaya-Okur HS, Wu SJ, Codomo CA, Pledger ES et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. GSE124557. Gene Expression Omnibus (GEO). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE1245572018 . 2019.