Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping.
Binding Sites
/ genetics
Computational Biology
/ methods
Datasets as Topic
Deep Learning
Escherichia coli
/ genetics
Gene Knockout Techniques
Genome, Bacterial
/ genetics
High-Throughput Nucleotide Sequencing
Molecular Sequence Annotation
/ methods
Phenotype
Regulatory Sequences, Nucleic Acid
/ genetics
Ribosomes
/ metabolism
Sequence Analysis, DNA
/ methods
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
15 07 2020
15 07 2020
Historique:
received:
11
02
2020
accepted:
13
06
2020
entrez:
17
7
2020
pubmed:
17
7
2020
medline:
10
9
2020
Statut:
epublish
Résumé
Predicting effects of gene regulatory elements (GREs) is a longstanding challenge in biology. Machine learning may address this, but requires large datasets linking GREs to their quantitative function. However, experimental methods to generate such datasets are either application-specific or technically complex and error-prone. Here, we introduce DNA-based phenotypic recording as a widely applicable, practicable approach to generate large-scale sequence-function datasets. We use a site-specific recombinase to directly record a GRE's effect in DNA, enabling readout of both sequence and quantitative function for extremely large GRE-sets via next-generation sequencing. We record translation kinetics of over 300,000 bacterial ribosome binding sites (RBSs) in >2.7 million sequence-function pairs in a single experiment. Further, we introduce a deep learning approach employing ensembling and uncertainty modelling that predicts RBS function with high accuracy, outperforming state-of-the-art methods. DNA-based phenotypic recording combined with deep learning represents a major advance in our ability to predict function from genetic sequence.
Identifiants
pubmed: 32669542
doi: 10.1038/s41467-020-17222-4
pii: 10.1038/s41467-020-17222-4
pmc: PMC7363850
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
3551Références
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
pubmed: 27184599
doi: 10.1038/nrg.2016.49
Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).
pubmed: 24781323
pmcid: 7098426
doi: 10.1038/nmeth.2918
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
pubmed: 22609971
pmcid: 3374032
doi: 10.1038/nbt.2205
Mutalik, V. K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).
pubmed: 23474465
doi: 10.1038/nmeth.2404
Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010).
pubmed: 20711194
pmcid: 2938879
doi: 10.1038/nmeth.1492
Atwal, G. S. & Kinney, J. B. Learning quantitative sequence-function relationships from massively parallel experiments. J. Stat. Phys. 162, 1203–1243 (2016).
doi: 10.1007/s10955-015-1398-3
Raad, M., Modavi, C., Sukovich, D. J. & Anderson, J. C. Observing biosynthetic activity utilizing next generation sequencing and the DNA linked enzyme coupled assay. ACS Chem. Biol. 12, 191–199 (2017).
pubmed: 28103681
doi: 10.1021/acschembio.6b00652
Hertzberg, R. P. & Pope, A. J. High-throughput screening: new technology for the 21st century. Curr. Opin. Chem. Biol. 4, 445–451 (2000).
pubmed: 10959774
doi: 10.1016/S1367-5931(00)00110-1
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, Cambridge, MA, 2016.
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
pubmed: 26301843
pmcid: 4768299
doi: 10.1038/nmeth.3547
Park, Y. & Kellis, M. Deep learning for regulatory genomics. Nat. Biotechnol. 33, 825–826 (2015).
pubmed: 26252139
doi: 10.1038/nbt.3313
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
pubmed: 26213851
doi: 10.1038/nbt.3300
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
pubmed: 30504886
pmcid: 6289068
doi: 10.1038/s41592-018-0229-2
Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).
pubmed: 30478442
doi: 10.1038/s41588-018-0295-5
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
pubmed: 31942072
doi: 10.1038/s41586-019-1923-7
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
doi: 10.1038/nature14539
pubmed: 26017442
Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C. & Collins, J. J. Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018).
pubmed: 29887378
doi: 10.1016/j.cell.2018.05.015
Kinney, J. B., Murugan, A., Callan, C. G. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).
pubmed: 20439748
doi: 10.1073/pnas.1004290107
pmcid: 2889059
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).
pubmed: 23924614
doi: 10.1073/pnas.1301301110
pmcid: 3752251
Cambray, G., Guimaraes, J. C. & Arkin, A. P. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nat. Biotechnol. 36, 1005–1015 (2018).
pubmed: 30247489
doi: 10.1038/nbt.4238
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2019).
pubmed: 31792407
pmcid: 6954276
doi: 10.1038/s41587-019-0315-8
Peterman, N. & Levine, E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genomics 17, 206 (2016).
pubmed: 26956374
pmcid: 4784318
doi: 10.1186/s12864-016-2533-5
Sample, P. J. et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).
pubmed: 31267113
pmcid: 7100133
doi: 10.1038/s41587-019-0164-5
Yus, E., Yang, J. S., Sogues, A. & Serrano, L. A reporter system coupled with high-throughput sequencing unveils key bacterial transcription and translation determinants. Nat. Commun. 8, 368 (2017).
pubmed: 28848232
pmcid: 5573727
doi: 10.1038/s41467-017-00239-7
Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).
pubmed: 29097404
pmcid: 5741052
doi: 10.1101/gr.224964.117
Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. The impact of amplification on differential expression analyses by RNA-seq. Sci. Rep. 6, 25533 (2016).
pubmed: 27156886
pmcid: 4860583
doi: 10.1038/srep25533
Katayama, S. et al. Guide for library design and bias correction for large-scale transcriptome studies using highly multiplexed RNAseq methods. BMC Bioinformatics 20, 418 (2019).
pubmed: 31409293
pmcid: 6693229
doi: 10.1186/s12859-019-3017-9
Orban, P. C., Chui, D. & Marth, J. D. Tissue- and site-specific DNA recombination in transgenic mice. Proc. Natl Acad. Sci. USA 89, 6861–6865 (1992).
pubmed: 1495975
doi: 10.1073/pnas.89.15.6861
pmcid: 49604
Kaczmarczyk, S. J. & Green, J. E. A single vector containing modified cre recombinase and LOX recombination sequences for inducible tissue-specific amplification of gene expression. Nucleic Acids Res. 29, E56–E56 (2001).
pubmed: 11410679
pmcid: 55755
doi: 10.1093/nar/29.12.e56
Altier, C. & Suyemoto, M. A recombinase-based selection of differentially expressed bacterial genes. Gene 240, 99–106 (1999).
pubmed: 10564816
doi: 10.1016/S0378-1119(99)00427-8
Buchholz, F. & Stewart, A. F. Alteration of Cre recombinase site specificity by substrate-linked protein evolution. Nat. Biotechnol. 19, 1047–1052 (2001).
pubmed: 11689850
doi: 10.1038/nbt1101-1047
Kim, A. I. et al. Mycobacteriophage Bxb1 integrates into the Mycobacterium smegmatis groEL1 gene. Mol. Microbiol. 50, 463–473 (2003).
pubmed: 14617171
doi: 10.1046/j.1365-2958.2003.03723.x
Xu, Z. Y. et al. Accuracy and efficiency define Bxb1 integrase as the best of fifteen candidate serine recombinases for the integration of DNA into the human genome. BMC Biotechnol. 13, 78 (2013).
doi: 10.1186/1472-6750-13-87
Jusiak, B. et al. Comparison of integrases identifies Bxb1-GA mutant as the most efficient site-specific integrase system in mammalian cells. ACS Synth. Biol. 8, 16–24 (2019).
pubmed: 30609349
doi: 10.1021/acssynbio.8b00089
Lobner-Olesen, A., Skovgaard, O. & Marinus, M. G. Dam methylation: coordinating cellular processes. Curr. Opin. Microbiol. 8, 154–160 (2005).
pubmed: 15802246
doi: 10.1016/j.mib.2005.02.009
Southall, T. D. et al. Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA Pol II occupancy in neural stem cells. Dev. Cell 26, 101–112 (2013).
pubmed: 23792147
pmcid: 3714590
doi: 10.1016/j.devcel.2013.05.020
Egan, S. M. & Schleif, R. F. A regulatory cascade in the induction of rhaBAD. J. Mol. Biol. 234, 87–98 (1993).
pubmed: 8230210
doi: 10.1006/jmbi.1993.1565
Laursen, B. S., Sorensen, H. P., Mortensen, K. K. & Sperling-Petersen, H. U. Initiation of protein synthesis in bacteria. Microbiol. Mol. Biol. Rev. 69, 101–123 (2005).
pubmed: 15755955
pmcid: 1082788
doi: 10.1128/MMBR.69.1.101-123.2005
Jeschek, M., Gerngross, D. & Panke, S. Combinatorial pathway optimization for streamlined metabolic engineering. Curr. Opin. Biotechnol. 47, 142–151 (2017).
pubmed: 28750202
doi: 10.1016/j.copbio.2017.06.014
Jervis, A. J. et al. Machine learning of designed translational control allows predictive pathway optimization in Escherichia coli. ACS Synth. Biol. 8, 127–136 (2019).
pubmed: 30563328
doi: 10.1021/acssynbio.8b00398
Salis, H. M., Mirsky, E. A. & Voigt, C. A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009).
pubmed: 19801975
pmcid: 2782888
doi: 10.1038/nbt.1568
Na, D. & Lee, D. RBSDesigner: software for designing synthetic ribosome binding sites that yields a desired level of protein expression. Bioinformatics 26, 2633–2634 (2010).
pubmed: 20702394
doi: 10.1093/bioinformatics/btq458
Seo, S. W. et al. Predictive design of mRNA translation initiation region to control prokaryotic translation efficiency. Metab. Eng. 15, 67–74 (2013).
pubmed: 23164579
doi: 10.1016/j.ymben.2012.10.006
Borujeni, A. E., Channarasappa, A. S. & Salis, H. M. Translation rate is controlled by coupled trade-offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic Acids Res. 42, 2646–2659 (2014).
doi: 10.1093/nar/gkt1139
Farasat, I. et al. Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Mol. Syst. Biol. 10, 731 (2014).
Jeschek, M., Gerngross, D. & Panke, S. Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort. Nat. Commun. 7, 11163 (2016).
pubmed: 27029461
pmcid: 4821882
doi: 10.1038/ncomms11163
Reeve, B., Hargest, T., Gilbert, C. & Ellis, T. Predicting translation initiation rates for designing synthetic biology. Front. Bioeng. Biotechnol. 2, 1–6 (2014).
pubmed: 25152877
pmcid: 4126478
doi: 10.3389/fbioe.2014.00001
Vigar, J. R. J. & Wieden, H. J. Engineering bacterial translation initiation—do we have all the tools we need? Biochim. Biophys. Acta 1861, 3060–3069 (2017).
doi: 10.1016/j.bbagen.2017.03.008
He, K. M., Zhang, X. Y., Ren, S. Q. & Sun, J. in Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017).
LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).
doi: 10.1162/neco.1989.1.4.541
Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2 (Springer, New York, 2009).
doi: 10.1007/978-0-387-84858-7
Altam, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
doi: 10.1023/A:1010933404324
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
doi: 10.1214/aos/1013203451
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neur. 30, 6402–6413 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Proc. 34th Int. Conf. Mach. Learn. 70, 3319–3328 (2017).
Lapique, N. & Benenson, Y. Genetic programs can be compressed and autonomously decompressed in live cells. Nat. Nanotechnol. 13, 309–315 (2018).
pubmed: 29133926
doi: 10.1038/s41565-017-0004-z
Roquet, N., Soleimany, A. P., Ferris, A. C., Aaronson, S. & Lu, T. K. Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016).
pubmed: 27463678
doi: 10.1126/science.aad8559
Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).
pubmed: 19359587
pmcid: 3902468
doi: 10.1126/science.1170160
Jeschek, M. et al. Biotin-independent strains of Escherichia coli for enhanced streptavidin production. Metab. Eng. 40, 33–40 (2017).
pubmed: 28062280
doi: 10.1016/j.ymben.2016.12.013
Martinez-Garcia, E., Aparicio, T., Goni-Moreno, A., Fraile, S. & de Lorenzo, V. SEVA 2.0: an update of the Standard European Vector Architecture for de-/re-construction of bacterial functionalities. Nucleic Acids Res. 43, D1183–D1189 (2015).
pubmed: 25392407
doi: 10.1093/nar/gku1114
Datsenko, K. A. & Wanner, B. L. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl Acad. Sci. USA 97, 6640–6645 (2000).
pubmed: 10829079
doi: 10.1073/pnas.120163297
pmcid: 18686
Perez-Cruz, F. Estimation of information theoretic measures for continuous random variables. Adv. Neural Inform. Process. Syst. 21, 1257–1264 (2009).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
pubmed: 29588361
pmcid: 5932613
doi: 10.1101/gr.227819.117
Ioffe, S. & S., C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. 32nd Int. Conf. Mach. Learn. 37, 448–456 (2015).
Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
Abadi, M. et al. in Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (2016). https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).