Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
15 07 2020
Historique:
received: 11 02 2020
accepted: 13 06 2020
entrez: 17 7 2020
pubmed: 17 7 2020
medline: 10 9 2020
Statut: epublish

Résumé

Predicting effects of gene regulatory elements (GREs) is a longstanding challenge in biology. Machine learning may address this, but requires large datasets linking GREs to their quantitative function. However, experimental methods to generate such datasets are either application-specific or technically complex and error-prone. Here, we introduce DNA-based phenotypic recording as a widely applicable, practicable approach to generate large-scale sequence-function datasets. We use a site-specific recombinase to directly record a GRE's effect in DNA, enabling readout of both sequence and quantitative function for extremely large GRE-sets via next-generation sequencing. We record translation kinetics of over 300,000 bacterial ribosome binding sites (RBSs) in >2.7 million sequence-function pairs in a single experiment. Further, we introduce a deep learning approach employing ensembling and uncertainty modelling that predicts RBS function with high accuracy, outperforming state-of-the-art methods. DNA-based phenotypic recording combined with deep learning represents a major advance in our ability to predict function from genetic sequence.

Identifiants

pubmed: 32669542
doi: 10.1038/s41467-020-17222-4
pii: 10.1038/s41467-020-17222-4
pmc: PMC7363850
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

3551

Références

Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
pubmed: 27184599 doi: 10.1038/nrg.2016.49
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
pubmed: 24781323 pmcid: 7098426 doi: 10.1038/nmeth.2918
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
pubmed: 22609971 pmcid: 3374032 doi: 10.1038/nbt.2205
Mutalik, V. K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).
pubmed: 23474465 doi: 10.1038/nmeth.2404
Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010).
pubmed: 20711194 pmcid: 2938879 doi: 10.1038/nmeth.1492
Atwal, G. S. & Kinney, J. B. Learning quantitative sequence-function relationships from massively parallel experiments. J. Stat. Phys. 162, 1203–1243 (2016).
doi: 10.1007/s10955-015-1398-3
Raad, M., Modavi, C., Sukovich, D. J. & Anderson, J. C. Observing biosynthetic activity utilizing next generation sequencing and the DNA linked enzyme coupled assay. ACS Chem. Biol. 12, 191–199 (2017).
pubmed: 28103681 doi: 10.1021/acschembio.6b00652
Hertzberg, R. P. & Pope, A. J. High-throughput screening: new technology for the 21st century. Curr. Opin. Chem. Biol. 4, 445–451 (2000).
pubmed: 10959774 doi: 10.1016/S1367-5931(00)00110-1
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, Cambridge, MA, 2016.
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
pubmed: 26301843 pmcid: 4768299 doi: 10.1038/nmeth.3547
Park, Y. & Kellis, M. Deep learning for regulatory genomics. Nat. Biotechnol. 33, 825–826 (2015).
pubmed: 26252139 doi: 10.1038/nbt.3313
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
pubmed: 26213851 doi: 10.1038/nbt.3300
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
pubmed: 30504886 pmcid: 6289068 doi: 10.1038/s41592-018-0229-2
Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).
pubmed: 30478442 doi: 10.1038/s41588-018-0295-5
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
pubmed: 31942072 doi: 10.1038/s41586-019-1923-7
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
doi: 10.1038/nature14539 pubmed: 26017442
Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C. & Collins, J. J. Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018).
pubmed: 29887378 doi: 10.1016/j.cell.2018.05.015
Kinney, J. B., Murugan, A., Callan, C. G. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).
pubmed: 20439748 doi: 10.1073/pnas.1004290107 pmcid: 2889059
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).
pubmed: 23924614 doi: 10.1073/pnas.1301301110 pmcid: 3752251
Cambray, G., Guimaraes, J. C. & Arkin, A. P. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nat. Biotechnol. 36, 1005–1015 (2018).
pubmed: 30247489 doi: 10.1038/nbt.4238
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2019).
pubmed: 31792407 pmcid: 6954276 doi: 10.1038/s41587-019-0315-8
Peterman, N. & Levine, E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genomics 17, 206 (2016).
pubmed: 26956374 pmcid: 4784318 doi: 10.1186/s12864-016-2533-5
Sample, P. J. et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).
pubmed: 31267113 pmcid: 7100133 doi: 10.1038/s41587-019-0164-5
Yus, E., Yang, J. S., Sogues, A. & Serrano, L. A reporter system coupled with high-throughput sequencing unveils key bacterial transcription and translation determinants. Nat. Commun. 8, 368 (2017).
pubmed: 28848232 pmcid: 5573727 doi: 10.1038/s41467-017-00239-7
Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).
pubmed: 29097404 pmcid: 5741052 doi: 10.1101/gr.224964.117
Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. The impact of amplification on differential expression analyses by RNA-seq. Sci. Rep. 6, 25533 (2016).
pubmed: 27156886 pmcid: 4860583 doi: 10.1038/srep25533
Katayama, S. et al. Guide for library design and bias correction for large-scale transcriptome studies using highly multiplexed RNAseq methods. BMC Bioinformatics 20, 418 (2019).
pubmed: 31409293 pmcid: 6693229 doi: 10.1186/s12859-019-3017-9
Orban, P. C., Chui, D. & Marth, J. D. Tissue- and site-specific DNA recombination in transgenic mice. Proc. Natl Acad. Sci. USA 89, 6861–6865 (1992).
pubmed: 1495975 doi: 10.1073/pnas.89.15.6861 pmcid: 49604
Kaczmarczyk, S. J. & Green, J. E. A single vector containing modified cre recombinase and LOX recombination sequences for inducible tissue-specific amplification of gene expression. Nucleic Acids Res. 29, E56–E56 (2001).
pubmed: 11410679 pmcid: 55755 doi: 10.1093/nar/29.12.e56
Altier, C. & Suyemoto, M. A recombinase-based selection of differentially expressed bacterial genes. Gene 240, 99–106 (1999).
pubmed: 10564816 doi: 10.1016/S0378-1119(99)00427-8
Buchholz, F. & Stewart, A. F. Alteration of Cre recombinase site specificity by substrate-linked protein evolution. Nat. Biotechnol. 19, 1047–1052 (2001).
pubmed: 11689850 doi: 10.1038/nbt1101-1047
Kim, A. I. et al. Mycobacteriophage Bxb1 integrates into the Mycobacterium smegmatis groEL1 gene. Mol. Microbiol. 50, 463–473 (2003).
pubmed: 14617171 doi: 10.1046/j.1365-2958.2003.03723.x
Xu, Z. Y. et al. Accuracy and efficiency define Bxb1 integrase as the best of fifteen candidate serine recombinases for the integration of DNA into the human genome. BMC Biotechnol. 13, 78 (2013).
doi: 10.1186/1472-6750-13-87
Jusiak, B. et al. Comparison of integrases identifies Bxb1-GA mutant as the most efficient site-specific integrase system in mammalian cells. ACS Synth. Biol. 8, 16–24 (2019).
pubmed: 30609349 doi: 10.1021/acssynbio.8b00089
Lobner-Olesen, A., Skovgaard, O. & Marinus, M. G. Dam methylation: coordinating cellular processes. Curr. Opin. Microbiol. 8, 154–160 (2005).
pubmed: 15802246 doi: 10.1016/j.mib.2005.02.009
Southall, T. D. et al. Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA Pol II occupancy in neural stem cells. Dev. Cell 26, 101–112 (2013).
pubmed: 23792147 pmcid: 3714590 doi: 10.1016/j.devcel.2013.05.020
Egan, S. M. & Schleif, R. F. A regulatory cascade in the induction of rhaBAD. J. Mol. Biol. 234, 87–98 (1993).
pubmed: 8230210 doi: 10.1006/jmbi.1993.1565
Laursen, B. S., Sorensen, H. P., Mortensen, K. K. & Sperling-Petersen, H. U. Initiation of protein synthesis in bacteria. Microbiol. Mol. Biol. Rev. 69, 101–123 (2005).
pubmed: 15755955 pmcid: 1082788 doi: 10.1128/MMBR.69.1.101-123.2005
Jeschek, M., Gerngross, D. & Panke, S. Combinatorial pathway optimization for streamlined metabolic engineering. Curr. Opin. Biotechnol. 47, 142–151 (2017).
pubmed: 28750202 doi: 10.1016/j.copbio.2017.06.014
Jervis, A. J. et al. Machine learning of designed translational control allows predictive pathway optimization in Escherichia coli. ACS Synth. Biol. 8, 127–136 (2019).
pubmed: 30563328 doi: 10.1021/acssynbio.8b00398
Salis, H. M., Mirsky, E. A. & Voigt, C. A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009).
pubmed: 19801975 pmcid: 2782888 doi: 10.1038/nbt.1568
Na, D. & Lee, D. RBSDesigner: software for designing synthetic ribosome binding sites that yields a desired level of protein expression. Bioinformatics 26, 2633–2634 (2010).
pubmed: 20702394 doi: 10.1093/bioinformatics/btq458
Seo, S. W. et al. Predictive design of mRNA translation initiation region to control prokaryotic translation efficiency. Metab. Eng. 15, 67–74 (2013).
pubmed: 23164579 doi: 10.1016/j.ymben.2012.10.006
Borujeni, A. E., Channarasappa, A. S. & Salis, H. M. Translation rate is controlled by coupled trade-offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic Acids Res. 42, 2646–2659 (2014).
doi: 10.1093/nar/gkt1139
Farasat, I. et al. Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Mol. Syst. Biol. 10, 731 (2014).
Jeschek, M., Gerngross, D. & Panke, S. Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort. Nat. Commun. 7, 11163 (2016).
pubmed: 27029461 pmcid: 4821882 doi: 10.1038/ncomms11163
Reeve, B., Hargest, T., Gilbert, C. & Ellis, T. Predicting translation initiation rates for designing synthetic biology. Front. Bioeng. Biotechnol. 2, 1–6 (2014).
pubmed: 25152877 pmcid: 4126478 doi: 10.3389/fbioe.2014.00001
Vigar, J. R. J. & Wieden, H. J. Engineering bacterial translation initiation—do we have all the tools we need? Biochim. Biophys. Acta 1861, 3060–3069 (2017).
doi: 10.1016/j.bbagen.2017.03.008
He, K. M., Zhang, X. Y., Ren, S. Q. & Sun, J. in Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017).
LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).
doi: 10.1162/neco.1989.1.4.541
Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2 (Springer, New York, 2009).
doi: 10.1007/978-0-387-84858-7
Altam, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
doi: 10.1023/A:1010933404324
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
doi: 10.1214/aos/1013203451
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neur. 30, 6402–6413 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Proc. 34th Int. Conf. Mach. Learn. 70, 3319–3328 (2017).
Lapique, N. & Benenson, Y. Genetic programs can be compressed and autonomously decompressed in live cells. Nat. Nanotechnol. 13, 309–315 (2018).
pubmed: 29133926 doi: 10.1038/s41565-017-0004-z
Roquet, N., Soleimany, A. P., Ferris, A. C., Aaronson, S. & Lu, T. K. Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016).
pubmed: 27463678 doi: 10.1126/science.aad8559
Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).
pubmed: 19359587 pmcid: 3902468 doi: 10.1126/science.1170160
Jeschek, M. et al. Biotin-independent strains of Escherichia coli for enhanced streptavidin production. Metab. Eng. 40, 33–40 (2017).
pubmed: 28062280 doi: 10.1016/j.ymben.2016.12.013
Martinez-Garcia, E., Aparicio, T., Goni-Moreno, A., Fraile, S. & de Lorenzo, V. SEVA 2.0: an update of the Standard European Vector Architecture for de-/re-construction of bacterial functionalities. Nucleic Acids Res. 43, D1183–D1189 (2015).
pubmed: 25392407 doi: 10.1093/nar/gku1114
Datsenko, K. A. & Wanner, B. L. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl Acad. Sci. USA 97, 6640–6645 (2000).
pubmed: 10829079 doi: 10.1073/pnas.120163297 pmcid: 18686
Perez-Cruz, F. Estimation of information theoretic measures for continuous random variables. Adv. Neural Inform. Process. Syst. 21, 1257–1264 (2009).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
pubmed: 29588361 pmcid: 5932613 doi: 10.1101/gr.227819.117
Ioffe, S. & S., C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. 32nd Int. Conf. Mach. Learn. 37, 448–456 (2015).
Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
Abadi, M. et al. in Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (2016). https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).

Auteurs

Simon Höllerer (S)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.

Laetitia Papaxanthos (L)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.
Swiss Institute of Bioinformatics, 4058, Basel, Switzerland.

Anja Cathrin Gumpinger (AC)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.
Swiss Institute of Bioinformatics, 4058, Basel, Switzerland.

Katrin Fischer (K)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.

Christian Beisel (C)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland.

Karsten Borgwardt (K)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland. karsten.borgwardt@bsse.ethz.ch.
Swiss Institute of Bioinformatics, 4058, Basel, Switzerland. karsten.borgwardt@bsse.ethz.ch.

Yaakov Benenson (Y)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland. kobi.benenson@bsse.ethz.ch.

Markus Jeschek (M)

Department of Biosystems Science and Engineering, ETH Zurich, 4058, Basel, Switzerland. markus.jeschek@bsse.ethz.ch.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
Coal Metagenome Phylogeny Bacteria Genome, Bacterial

Classifications MeSH