Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping.

Binding Sites / genetics Computational Biology / methods Datasets as Topic Deep Learning Escherichia coli / genetics Gene Knockout Techniques Genome, Bacterial / genetics High-Throughput Nucleotide Sequencing Molecular Sequence Annotation / methods Phenotype Regulatory Sequences, Nucleic Acid / genetics Ribosomes / metabolism Sequence Analysis, DNA / methods

Journal

Nature communications

ISSN: 2041-1723

Titre abrégé: Nat Commun

Pays: England

ID NLM: 101528555

Informations de publication

Date de publication:
15 07 2020

Historique:

received: 11 02 2020

accepted: 13 06 2020

entrez: 17 7 2020

pubmed: 17 7 2020

medline: 10 9 2020

Statut: epublish

Résumé

Predicting effects of gene regulatory elements (GREs) is a longstanding challenge in biology. Machine learning may address this, but requires large datasets linking GREs to their quantitative function. However, experimental methods to generate such datasets are either application-specific or technically complex and error-prone. Here, we introduce DNA-based phenotypic recording as a widely applicable, practicable approach to generate large-scale sequence-function datasets. We use a site-specific recombinase to directly record a GRE's effect in DNA, enabling readout of both sequence and quantitative function for extremely large GRE-sets via next-generation sequencing. We record translation kinetics of over 300,000 bacterial ribosome binding sites (RBSs) in >2.7 million sequence-function pairs in a single experiment. Further, we introduce a deep learning approach employing ensembling and uncertainty modelling that predicts RBS function with high accuracy, outperforming state-of-the-art methods. DNA-based phenotypic recording combined with deep learning represents a major advance in our ability to predict function from genetic sequence.

Identifiants

DOI: 10.1038/s41467-020-17222-4 PMID: 32669542 PMC: PMC7363850

pubmed: 32669542

doi: 10.1038/s41467-020-17222-4

pii: 10.1038/s41467-020-17222-4

pmc: PMC7363850

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

3551

Références

Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

pubmed: 27184599 doi: 10.1038/nrg.2016.49

Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nat. Methods 11, 499–507 (2014).

pubmed: 24781323 pmcid: 7098426 doi: 10.1038/nmeth.2918

Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

pubmed: 22609971 pmcid: 3374032 doi: 10.1038/nbt.2205

Mutalik, V. K. et al. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat. Methods 10, 354–360 (2013).

pubmed: 23474465 doi: 10.1038/nmeth.2404

Fowler, D. M. et al. High-resolution mapping of protein sequence-function relationships. Nat. Methods 7, 741–746 (2010).

pubmed: 20711194 pmcid: 2938879 doi: 10.1038/nmeth.1492

Atwal, G. S. & Kinney, J. B. Learning quantitative sequence-function relationships from massively parallel experiments. J. Stat. Phys. 162, 1203–1243 (2016).

doi: 10.1007/s10955-015-1398-3

Raad, M., Modavi, C., Sukovich, D. J. & Anderson, J. C. Observing biosynthetic activity utilizing next generation sequencing and the DNA linked enzyme coupled assay. ACS Chem. Biol. 12, 191–199 (2017).

pubmed: 28103681 doi: 10.1021/acschembio.6b00652

Hertzberg, R. P. & Pope, A. J. High-throughput screening: new technology for the 21st century. Curr. Opin. Chem. Biol. 4, 445–451 (2000).

pubmed: 10959774 doi: 10.1016/S1367-5931(00)00110-1

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, Cambridge, MA, 2016.

Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

pubmed: 26301843 pmcid: 4768299 doi: 10.1038/nmeth.3547

Park, Y. & Kellis, M. Deep learning for regulatory genomics. Nat. Biotechnol. 33, 825–826 (2015).

pubmed: 26252139 doi: 10.1038/nbt.3313

Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

pubmed: 26213851 doi: 10.1038/nbt.3300

Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

pubmed: 30504886 pmcid: 6289068 doi: 10.1038/s41592-018-0229-2

Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).

pubmed: 30478442 doi: 10.1038/s41588-018-0295-5

Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

pubmed: 31942072 doi: 10.1038/s41586-019-1923-7

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

doi: 10.1038/nature14539 pubmed: 26017442

Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C. & Collins, J. J. Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018).

pubmed: 29887378 doi: 10.1016/j.cell.2018.05.015

Kinney, J. B., Murugan, A., Callan, C. G. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA 107, 9158–9163 (2010).

pubmed: 20439748 doi: 10.1073/pnas.1004290107 pmcid: 2889059

Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).

pubmed: 23924614 doi: 10.1073/pnas.1301301110 pmcid: 3752251

Cambray, G., Guimaraes, J. C. & Arkin, A. P. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nat. Biotechnol. 36, 1005–1015 (2018).

pubmed: 30247489 doi: 10.1038/nbt.4238

de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2019).

pubmed: 31792407 pmcid: 6954276 doi: 10.1038/s41587-019-0315-8

Peterman, N. & Levine, E. Sort-seq under the hood: implications of design choices on large-scale characterization of sequence-function relations. BMC Genomics 17, 206 (2016).

pubmed: 26956374 pmcid: 4784318 doi: 10.1186/s12864-016-2533-5

Sample, P. J. et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat. Biotechnol. 37, 803–809 (2019).

pubmed: 31267113 pmcid: 7100133 doi: 10.1038/s41587-019-0164-5

Yus, E., Yang, J. S., Sogues, A. & Serrano, L. A reporter system coupled with high-throughput sequencing unveils key bacterial transcription and translation determinants. Nat. Commun. 8, 368 (2017).

pubmed: 28848232 pmcid: 5573727 doi: 10.1038/s41467-017-00239-7

Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).

pubmed: 29097404 pmcid: 5741052 doi: 10.1101/gr.224964.117

Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. The impact of amplification on differential expression analyses by RNA-seq. Sci. Rep. 6, 25533 (2016).

pubmed: 27156886 pmcid: 4860583 doi: 10.1038/srep25533

Katayama, S. et al. Guide for library design and bias correction for large-scale transcriptome studies using highly multiplexed RNAseq methods. BMC Bioinformatics 20, 418 (2019).

pubmed: 31409293 pmcid: 6693229 doi: 10.1186/s12859-019-3017-9

Orban, P. C., Chui, D. & Marth, J. D. Tissue- and site-specific DNA recombination in transgenic mice. Proc. Natl Acad. Sci. USA 89, 6861–6865 (1992).

pubmed: 1495975 doi: 10.1073/pnas.89.15.6861 pmcid: 49604

Kaczmarczyk, S. J. & Green, J. E. A single vector containing modified cre recombinase and LOX recombination sequences for inducible tissue-specific amplification of gene expression. Nucleic Acids Res. 29, E56–E56 (2001).

pubmed: 11410679 pmcid: 55755 doi: 10.1093/nar/29.12.e56

Altier, C. & Suyemoto, M. A recombinase-based selection of differentially expressed bacterial genes. Gene 240, 99–106 (1999).

pubmed: 10564816 doi: 10.1016/S0378-1119(99)00427-8

Buchholz, F. & Stewart, A. F. Alteration of Cre recombinase site specificity by substrate-linked protein evolution. Nat. Biotechnol. 19, 1047–1052 (2001).

pubmed: 11689850 doi: 10.1038/nbt1101-1047

Kim, A. I. et al. Mycobacteriophage Bxb1 integrates into the Mycobacterium smegmatis groEL1 gene. Mol. Microbiol. 50, 463–473 (2003).

pubmed: 14617171 doi: 10.1046/j.1365-2958.2003.03723.x

Xu, Z. Y. et al. Accuracy and efficiency define Bxb1 integrase as the best of fifteen candidate serine recombinases for the integration of DNA into the human genome. BMC Biotechnol. 13, 78 (2013).

doi: 10.1186/1472-6750-13-87

Jusiak, B. et al. Comparison of integrases identifies Bxb1-GA mutant as the most efficient site-specific integrase system in mammalian cells. ACS Synth. Biol. 8, 16–24 (2019).

pubmed: 30609349 doi: 10.1021/acssynbio.8b00089

Lobner-Olesen, A., Skovgaard, O. & Marinus, M. G. Dam methylation: coordinating cellular processes. Curr. Opin. Microbiol. 8, 154–160 (2005).

pubmed: 15802246 doi: 10.1016/j.mib.2005.02.009

Southall, T. D. et al. Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA Pol II occupancy in neural stem cells. Dev. Cell 26, 101–112 (2013).

pubmed: 23792147 pmcid: 3714590 doi: 10.1016/j.devcel.2013.05.020

Egan, S. M. & Schleif, R. F. A regulatory cascade in the induction of rhaBAD. J. Mol. Biol. 234, 87–98 (1993).

pubmed: 8230210 doi: 10.1006/jmbi.1993.1565

Laursen, B. S., Sorensen, H. P., Mortensen, K. K. & Sperling-Petersen, H. U. Initiation of protein synthesis in bacteria. Microbiol. Mol. Biol. Rev. 69, 101–123 (2005).

pubmed: 15755955 pmcid: 1082788 doi: 10.1128/MMBR.69.1.101-123.2005

Jeschek, M., Gerngross, D. & Panke, S. Combinatorial pathway optimization for streamlined metabolic engineering. Curr. Opin. Biotechnol. 47, 142–151 (2017).

pubmed: 28750202 doi: 10.1016/j.copbio.2017.06.014

Jervis, A. J. et al. Machine learning of designed translational control allows predictive pathway optimization in Escherichia coli. ACS Synth. Biol. 8, 127–136 (2019).

pubmed: 30563328 doi: 10.1021/acssynbio.8b00398

Salis, H. M., Mirsky, E. A. & Voigt, C. A. Automated design of synthetic ribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950 (2009).

pubmed: 19801975 pmcid: 2782888 doi: 10.1038/nbt.1568

Na, D. & Lee, D. RBSDesigner: software for designing synthetic ribosome binding sites that yields a desired level of protein expression. Bioinformatics 26, 2633–2634 (2010).

pubmed: 20702394 doi: 10.1093/bioinformatics/btq458

Seo, S. W. et al. Predictive design of mRNA translation initiation region to control prokaryotic translation efficiency. Metab. Eng. 15, 67–74 (2013).

pubmed: 23164579 doi: 10.1016/j.ymben.2012.10.006

Borujeni, A. E., Channarasappa, A. S. & Salis, H. M. Translation rate is controlled by coupled trade-offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic Acids Res. 42, 2646–2659 (2014).

doi: 10.1093/nar/gkt1139

Farasat, I. et al. Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Mol. Syst. Biol. 10, 731 (2014).

Jeschek, M., Gerngross, D. & Panke, S. Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort. Nat. Commun. 7, 11163 (2016).

pubmed: 27029461 pmcid: 4821882 doi: 10.1038/ncomms11163

Reeve, B., Hargest, T., Gilbert, C. & Ellis, T. Predicting translation initiation rates for designing synthetic biology. Front. Bioeng. Biotechnol. 2, 1–6 (2014).

pubmed: 25152877 pmcid: 4126478 doi: 10.3389/fbioe.2014.00001

Vigar, J. R. J. & Wieden, H. J. Engineering bacterial translation initiation—do we have all the tools we need? Biochim. Biophys. Acta 1861, 3060–3069 (2017).

doi: 10.1016/j.bbagen.2017.03.008

He, K. M., Zhang, X. Y., Ren, S. Q. & Sun, J. in Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).

Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500 (2017).

LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).

doi: 10.1162/neco.1989.1.4.541

Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2 (Springer, New York, 2009).

doi: 10.1007/978-0-387-84858-7

Altam, N. S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46, 175–185 (1992).

Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

doi: 10.1023/A:1010933404324

Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).

doi: 10.1214/aos/1013203451

Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neur. 30, 6402–6413 (2017).

Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Proc. 34th Int. Conf. Mach. Learn. 70, 3319–3328 (2017).

Lapique, N. & Benenson, Y. Genetic programs can be compressed and autonomously decompressed in live cells. Nat. Nanotechnol. 13, 309–315 (2018).

pubmed: 29133926 doi: 10.1038/s41565-017-0004-z

Roquet, N., Soleimany, A. P., Ferris, A. C., Aaronson, S. & Lu, T. K. Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016).

pubmed: 27463678 doi: 10.1126/science.aad8559

Kudla, G., Murray, A. W., Tollervey, D. & Plotkin, J. B. Coding-sequence determinants of gene expression in Escherichia coli. Science 324, 255–258 (2009).

pubmed: 19359587 pmcid: 3902468 doi: 10.1126/science.1170160

Jeschek, M. et al. Biotin-independent strains of Escherichia coli for enhanced streptavidin production. Metab. Eng. 40, 33–40 (2017).

pubmed: 28062280 doi: 10.1016/j.ymben.2016.12.013

Martinez-Garcia, E., Aparicio, T., Goni-Moreno, A., Fraile, S. & de Lorenzo, V. SEVA 2.0: an update of the Standard European Vector Architecture for de-/re-construction of bacterial functionalities. Nucleic Acids Res. 43, D1183–D1189 (2015).

pubmed: 25392407 doi: 10.1093/nar/gku1114

Datsenko, K. A. & Wanner, B. L. One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proc. Natl Acad. Sci. USA 97, 6640–6645 (2000).

pubmed: 10829079 doi: 10.1073/pnas.120163297 pmcid: 18686

Perez-Cruz, F. Estimation of information theoretic measures for continuous random variables. Adv. Neural Inform. Process. Syst. 21, 1257–1264 (2009).

Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

pubmed: 29588361 pmcid: 5932613 doi: 10.1101/gr.227819.117

Ioffe, S. & S., C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc. 32nd Int. Conf. Mach. Learn. 37, 448–456 (2015).

Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).

Abadi, M. et al. in Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (2016). https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).

Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Références

Auteurs

Simon Höllerer (S)

Laetitia Papaxanthos (L)

Anja Cathrin Gumpinger (AC)

Katrin Fischer (K)

Christian Beisel (C)

Karsten Borgwardt (K)

Yaakov Benenson (Y)

Markus Jeschek (M)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

Exploring structural diversity across the protein universe with The Encyclopedia of Domains.

Fasciola hepatica and Fasciola hybrid form co-existence in yak from Tibet of China: application of rDNA internal transcribed spacer.

Comparative genomic analysis and characterization of novel high-quality draft genomes from the coal metagenome.

Classifications MeSH