Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome.
Journal
Genome biology
ISSN: 1474-760X
Titre abrégé: Genome Biol
Pays: England
ID NLM: 100960660
Informations de publication
Date de publication:
30 03 2020
30 03 2020
Historique:
received:
14
05
2019
accepted:
26
02
2020
entrez:
2
4
2020
pubmed:
2
4
2020
medline:
24
2
2021
Statut:
epublish
Résumé
The human epigenome has been experimentally characterized by thousands of measurements for every basepair in the human genome. We propose a deep neural network tensor factorization method, Avocado, that compresses this epigenomic data into a dense, information-rich representation. We use this learned representation to impute epigenomic data more accurately than previous methods, and we show that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, replication timing, and an element of 3D chromatin architecture.
Identifiants
pubmed: 32228704
doi: 10.1186/s13059-020-01977-6
pii: 10.1186/s13059-020-01977-6
pmc: PMC7104480
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
81Subventions
Organisme : NHGRI NIH HHS
ID : U01 HG009395
Pays : United States
Organisme : NIDDK NIH HHS
ID : U54 DK107979
Pays : United States
Organisme : NHGRI NIH HHS
ID : U41 HG007000
Pays : United States
Commentaires et corrections
Type : ErratumIn
Références
Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods. 2012; 9(3):215–6.
doi: 10.1038/nmeth.1906
Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, Giardine B, Ellenbogen PM, Bilmes JA, Birney E, Hardison RC, Dunham I, Kellis M, Noble WS. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 2013; 41(2):827–41.
doi: 10.1093/nar/gks1284
Libbrecht MW, Rodriguez O, Weng Z, Hoffman M, Bilmes JA, Noble WS. A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types (preprint in advance of publication). bioRxiv. 2016. https://doi.org/10.1101/086025 .
Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015; 16(6):321–32.
doi: 10.1038/nrg3920
Durham TJ, Libbrecht MW, Howbert JJ, Bilmes JA, Noble WS. PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition. Nat Commun. 2018:9. https://doi.org/10.1038/s41467-018-03635-9 .
Ernst Jason, Kellis Manolis. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat Biotechnol. 2015; 33(4):364–76.
doi: 10.1038/nbt.3157
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: ICML: 2013.
Whalen S, Truty RM, Pollard KS. Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet. 2016; 48:488–96.
doi: 10.1038/ng.3539
Schmitt AD, Hu M, Jung I, Xu Z, Qiu Y, Tan CL, Li Y, Lin S, Lin Y, Barr CL, Ren B. A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Rep. 2016; 17:2042–59.
doi: 10.1016/j.celrep.2016.10.061
Trigeorgis G, Bousmalis K, Zafeiriou S, Schuller BW. A deep matrix factorization method for learning attribute representations. IEEE Trans Pattern Anal Mach Intell. 2017:417–29. https://doi.org/10.1109/tpami.2016.2554555 .
Fan J, Cheng J. Matrix completion by deep matrix factorization. Neural Netw. 2018; 98:34–41.
doi: 10.1016/j.neunet.2017.10.007
McInnes L, Healy J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv. 2018.
Bannister AJ, Kouzarides T. Regulation of chromatin by histone modifications. Cell Res. 2011; 21(3):381–95.
doi: 10.1038/cr.2011.22
Kouzarides T. Chromatin modifications and their function. Cell. 2007; 128(4):693–705.
doi: 10.1016/j.cell.2007.02.005
Suganuma T, Workman JL. Signals and combinatorial functions of histone modifications. Ann Rev Biochem. 2011; 80:473–499.
doi: 10.1146/annurev-biochem-061809-175347
Suganama T, Workman JL. Crosstalk among histone modifications. Cell. 2008; 135:604–607.
doi: 10.1016/j.cell.2008.10.036
Daumé III H. Frustratingly easy domain adaptation. In: Conference of the Association for Computational Linguistics: 2007.
Razavian AS, Azizpour H, Sullivan J, Carlsson S. CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW ’14. Washington: IEEE Computer Society: 2014. p. 512–9.
Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2010; 22:1345–59.
doi: 10.1109/TKDE.2009.191
Sandulescu V, Chiru M. Predicting the future relevance of research institutions - the winning solution of the KDD Cup 2016. CoRR. 2016:abs/1609.02728.
Volkovs M, Yu GW, Poutanen T. Content-based neighbor models for cold start in recommender systems. In: Proceedings of the Recommender Systems Challenge 2017, RecSys Challenge ’17. New York: ACM: 2017. p. 7:1–7:6.
Singh R, Lanchantin J, Robins G, Qi Y. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics. 2016; 32(17):i639—49.
doi: 10.1093/bioinformatics/btw427
Singh R, Lanchantin J, Sekhon A, Qi Y. Attend and predict: understanding gene regulation by selective attention on chromatin. Adv Neural Inf Process Syst. 2017:6788–98. https://doi.org/10.1101/329334 .
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74.
doi: 10.1038/nature11247
Mora A, Sandve GK, Gabrielsen OS, Eskeland R. The loop: promoter-enhancer interactions and bioinformatics. Brief Bioinforma. 2015; 17(6):980–95.
Heintzmann ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, ye Z, Lee LK, Stuart RK, Ching CW, Ching KA, Antosiewicz-Bourget JE, Liu H, Zhang X, Green RD, Lobanenkov VV, Stewart R, Thomson JA, Crawford GE, Kellis M, Ren B. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009; 459:108–12.
doi: 10.1038/nature07829
Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011; 473(7345):43–49.
doi: 10.1038/nature09906
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, Neri F, Nguyen ED, Qu H, Reynolds AP, Roach V, Safi A, Sanchez ME, Sanyal A, Shafer A, Simon JM, Song L, Vong S, Weaver M, Yan Y, Zhang Z, Zhang Z, Lenhard B, Tewari M, Dorschner MO, Hansen RS, Navas PA, Stamatoyannopoulos G, Iyer VR, Lieb JD, Sunyaev SR, Akey JM, Sabo PJ, Kaul R, Furey TS, Dekker J, Crawford GE, Stamatoyannopoulos JA. The accessible chromatin landscape of the human genome. Nature. 2012; 489(7414):75–82.
doi: 10.1038/nature11232
Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, Ntini E, Arner E, Valen E, Li K, Schwarzfischer L, Glatz D, Raithel J, Lilje B, Rapin N, Bagger FO, Jørgensen M, Andersen PR, Bertin N, Rackham O, Burroughs AM, Baillie JK, Ishizu Y, Shimizu Y, Furuhata E, Maeda S, Negishi Y, Mungall CJ, Meehan TF, Lassmann T, Itoh M, Kawaji H, Kondo N, Kawai J, Lennartsson A, Daub CO, Heutink P, Hume DA, Jensen TH, Suzuki H, Hayashizaki Y, Müller F, The FANTOM Consortium, Forrest ARR, Carninci P, Rehli M, Sandelin A. An atlas of active enhancers across human cell types and tissues. Nature. 2014; 507:455–61.
doi: 10.1038/nature12787
Xi W, Beer MA. Local epigenomic state cannot discriminate interacting and non-interacting enhancer-promoter pairs with high accuracy. PLOS Comput Biol. 2018; 14(12):1–7.
doi: 10.1371/journal.pcbi.1006625
Ryba T, Hiratani I, Lu J, Itoh M, Kulik M, Zhang J, Schulz TC, Robins AJ, Dalton S, Gilbert DM. Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. Genome Res. 2010; 20(6):761–70.
doi: 10.1101/gr.099655.109
Dileep V, Ay F, Sima J, Vera DL, Noble WS, Gilbert DM. Topologically-associating domains and their long-range contacts are established during early G1 coincident with the establishment of the replication timing program. Genome Res. 2015:gr–183699. https://doi.org/10.1101/gr.183699.114 .
Marchal C, Sasaki T, Vera D, Wilson K, Sima J, Rivera-Mulia JC, Trevilla-García C, Nogues C, Nafie E, Gilbert DM. Genome-wide analysis of replication timing by next-generation sequencing with E/L Repli-seq. Nat Protocol. 2018; 13(5):819.
doi: 10.1038/nprot.2017.148
Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009; 326(5950):289–93.
doi: 10.1126/science.1181369
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485(7398):376–80.
doi: 10.1038/nature11082
Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: ACM: 2016. p. 1135–44.
Shrikumar A, Greenside P, Shcherbina A, Kundaje A. Learning important features through propagating activation differences. In: International Conference on Machine Learning: 2017.
Lundberg S, Lee S. An unexpected unity among methods for interpreting model predictions. In: Neural Information Processing Systems: 2017.
Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: International Conference on Machine Learning: 2017.
Dumančić S, Blockeel H. Demystifying relational latent representations. In: Inductive logic programming. Springer International Publishing: 2018. p. 63–77. https://doi.org/10.1007/978-3-319-78090-0_5 .
Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods. 2012; 9(5):473–6.
doi: 10.1038/nmeth.1937
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: EMNLP, vol. 14: 2014. p. 1532–43. https://doi.org/10.3115/v1/d14-1162 .
Zhou J, Troyanskaya O. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015; 12:931–4.
doi: 10.1038/nmeth.3547
Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016; 26(7):990–9.
doi: 10.1101/gr.200535.115
Schreiber JM, Bilmes J, Noble WS. Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples. bioRxiv. 2019. https://www.biorxiv.org/content/10.1101/533273v1 .
Chollet F, et al. Keras. 2015. https://keras.io .
Theano Development Team. Theano: a Python framework for fast computation of mathematical expressions. arXiv e-prints. 2016:abs/1605.02688.
Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015; 61:85–117.
doi: 10.1016/j.neunet.2014.09.003
Kingma D, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations: 2015.
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, vol. 9: 2010. p. 249–56.
ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447:799–816.
doi: 10.1038/nature05874
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. New York: ACM: 2016. p. 785–94.
Schreiber JM. Avocado. GitHub. https://github.com/jmschrei/avocado .
Schreiber JM, Durham TJ, Bilmes J, Noble WS. Avocado source code. Zenodo. 2019. https://doi.org/10.5281/zenodo.3549064 .