Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree.


Journal

Nature biotechnology
ISSN: 1546-1696
Titre abrégé: Nat Biotechnol
Pays: United States
ID NLM: 9604648

Informations de publication

Date de publication:
20 Apr 2023
Historique:
received: 18 04 2022
accepted: 16 03 2023
medline: 21 4 2023
pubmed: 21 4 2023
entrez: 20 04 2023
Statut: aheadofprint

Résumé

Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10-100 times faster than assembly-based approaches and in most cases more accurate-the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.

Identifiants

pubmed: 37081138
doi: 10.1038/s41587-023-01753-4
pii: 10.1038/s41587-023-01753-4
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : UM1HG008898
Organisme : U.S. Department of Health & Human Services | NIH | National Institute of Allergy and Infectious Diseases (NIAID)
ID : 1U19AI144297
Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Swiss National Science Foundation)
ID : 183723
Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Swiss National Science Foundation)
ID : 205085

Commentaires et corrections

Type : UpdateOf

Informations de copyright

© 2023. The Author(s).

Références

Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
pubmed: 270744 pmcid: 432104 doi: 10.1073/pnas.74.11.5088
Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).
pubmed: 16513982 doi: 10.1126/science.1123061
Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504, 231–236 (2013).
pubmed: 24336283 doi: 10.1038/nature12779
Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
pubmed: 27572647 doi: 10.1038/nmicrobiol.2016.48
Abbosh, C. et al. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature 545, 446–451 (2017).
pubmed: 28445469 pmcid: 5812436 doi: 10.1038/nature22364
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).
pubmed: 27229144 pmcid: 4967023 doi: 10.1126/science.aaf7907
Hadfield, J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018).
pubmed: 29790939 pmcid: 6247931 doi: 10.1093/bioinformatics/bty407
Eisen, J. A. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8, 163–167 (1998).
pubmed: 9521918 doi: 10.1101/gr.8.3.163
Gaudet, P., Livstone, M. S., Lewis, S. E. & Thomas, P. D. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief. Bioinform. 12, 449–462 (2011).
pubmed: 21873635 pmcid: 3178059 doi: 10.1093/bib/bbr042
Zeng, L. et al. Resolution of deep angiosperm phylogeny using conserved nuclear genes and estimates of early divergence times. Nat. Commun. 5, 4956 (2014).
pubmed: 25249442 doi: 10.1038/ncomms5956
Delsuc, F., Tsagkogeorga, G., Lartillot, N. & Philippe, H. Additional molecular support for the new chordate phylogeny. Genesis 46, 592–604 (2008).
pubmed: 19003928 doi: 10.1002/dvg.20450
Telford, M. J., Bourlat, S. J., Economou, A., Papillon, D. & Rota-Stabelli, O. The evolution of the Ecdysozoa. Philos. Trans. R. Soc. Lond. B 363, 1529–1537 (2008).
doi: 10.1098/rstb.2007.2243
Philippe, H., Lartillot, N. & Brinkmann, H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol. Biol. Evol. 22, 1246–1253 (2005).
pubmed: 15703236 doi: 10.1093/molbev/msi111
Fernández, R., Edgecombe, G. D. & Giribet, G. Exploring phylogenetic relationships within myriapoda and the effects of matrix composition and occupancy on phylogenomic reconstruction. Syst. Biol. 65, 871–889 (2016).
pubmed: 27162151 pmcid: 4997009 doi: 10.1093/sysbio/syw041
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
pubmed: 27184599 doi: 10.1038/nrg.2016.49
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
pubmed: 34050336 pmcid: 8161719 doi: 10.1038/s41576-021-00367-3
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).
pubmed: 32424311 doi: 10.1038/s41576-020-0233-0
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
pubmed: 29599501 doi: 10.1038/s41576-018-0003-4
Lewin, H. A. et al. Earth BioGenome Project: sequencing life for the future of life. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).
pubmed: 29686065 pmcid: 5924910 doi: 10.1073/pnas.1720115115
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
pubmed: 23329690 pmcid: 3603318 doi: 10.1093/molbev/mst010
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2017).
pmcid: 5850278 doi: 10.1093/molbev/msx319
Altenhoff, A. M., Schneider, A., Gonnet, G. H. & Dessimoz, C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 39, D289–D294 (2011).
pubmed: 21113020 doi: 10.1093/nar/gkq1238
Altenhoff, A. M. et al. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res. 43, D240–D249 (2015).
pubmed: 25399418 doi: 10.1093/nar/gku1158
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
pubmed: 25371430 doi: 10.1093/molbev/msu300
Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).
pubmed: 33397413 pmcid: 7780692 doi: 10.1186/s13059-020-02229-3
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Luo, R. et al. Erratum: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 4, 30 (2015).
pubmed: 26161257 pmcid: 4496933 doi: 10.1186/s13742-015-0069-2
Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152–1163 (2019).
pubmed: 31235654 pmcid: 6633268 doi: 10.1101/gr.243212.118
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
pubmed: 27323842 pmcid: 4915045 doi: 10.1186/s13059-016-0997-x
Shen, X.-X. et al. Tempo and mode of genome evolution in the budding yeast subphylum. Cell https://doi.org/10.1016/j.cell.2018.10.023 (2018).
doi: 10.1016/j.cell.2018.10.023 pubmed: 30500530 pmcid: 6291210
Stavrou, A. A., Mixão, V., Boekhout, T. & Gabaldón, T. Misidentification of genome assemblies in public databases: the case of Naumovozyma dairenensis and proposal of a protocol to correct misidentifications. Yeast 35, 425–429 (2018).
pubmed: 29320804 doi: 10.1002/yea.3303
Stavrou, A. A., Mixão, V., Boekhout, T. & Gabaldón, T. Misidentification of genome assemblies in public databases: the case of Naumovozyma dairenensisand proposal of a protocol to correct misidentifications. Yeast 35, 425–429 (2018).
Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020).
pubmed: 32015507 pmcid: 7095418 doi: 10.1038/s41586-020-2012-7
Li, B. et al. Discovery of bat coronaviruses through surveillance and probe capture-based next-generation sequencing. mSphere 5, e00807–e00819 (2020).
pubmed: 31996413 pmcid: 6992374
Kwok, K. T. T. et al. Genome sequence of a Minacovirus strain from a farmed mink in the Netherlands. Microbiol. Resour. Announc. 10, e01451–20 (2021).
pubmed: 33632868 pmcid: 7909093 doi: 10.1128/MRA.01451-20
Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020).
pubmed: 32015508 pmcid: 7094943 doi: 10.1038/s41586-020-2008-3
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
pubmed: 20224823 pmcid: 2835736 doi: 10.1371/journal.pone.0009490
Woo, P. C. Y., Lau, S. K. P., Huang, Y. & Yuen, K.-Y. Coronavirus diversity, phylogeny and interspecies jumping. Exp. Biol. Med. 234, 1117–1127 (2009).
doi: 10.3181/0903-MR-94
Hodcroft, E. B. et al. Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature 591, 30–33 (2021).
pubmed: 33649511 doi: 10.1038/d41586-021-00525-x
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
pubmed: 33526886 pmcid: 7961889 doi: 10.1038/s41592-020-01056-5
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
pubmed: 32686750 pmcid: 7483855 doi: 10.1038/s41587-020-0503-6
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
pubmed: 33911273 pmcid: 8081667 doi: 10.1038/s41586-021-03451-0
Miga, K. H. & Wang, T. The need for a human pangenome reference sequence. Annu. Rev. Genomics Hum. Genet. 22, 81–102 (2021).
pubmed: 33929893 pmcid: 8410644 doi: 10.1146/annurev-genom-120120-081921
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
pubmed: 35357919 pmcid: 9186530 doi: 10.1126/science.abj6987
Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360, eaar6343 (2018).
pubmed: 29880660 pmcid: 6178954 doi: 10.1126/science.aar6343
Choi, B. et al. Identifying genetic markers for a range of phylogenetic utility—from species to family level. PLoS ONE 14, e0218995 (2019).
pubmed: 31369563 pmcid: 6675087 doi: 10.1371/journal.pone.0218995
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Fernández, R., Gabaldon, T. & Dessimoz, C. Orthology: definitions, prediction, and impact on species phylogeny inference. Phylogenetics in the Genomic Era 1–568, 78-2-9575069-0-3. hal-02535070v3; https://hal.science/hal-02535070v3/file/book_hyperef_v2_ISBN.pdf (2020).
Natsidis, P., Kapli, P., Schiffer, P. H. & Telford, M. J. Systematic errors in orthology inference and their effects on evolutionary analyses. iScience 24, 102110 (2021).
pubmed: 33659875 pmcid: 7892920 doi: 10.1016/j.isci.2021.102110
Kapli, P. et al. Lack of support for Deuterostomia prompts reinterpretation of the first Bilateria. Sci. Adv. 7, eabe2741 (2021).
pubmed: 33741592 pmcid: 7978419 doi: 10.1126/sciadv.abe2741
Graham, E. D., Heidelberg, J. F. & Tully, B. J. BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation. PeerJ 5, e3035 (2017).
pubmed: 28289564 pmcid: 5345454 doi: 10.7717/peerj.3035
Lu, Y. Y., Chen, T., Fuhrman, J. A. & Sun, F. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge. Bioinformatics 33, 791–798 (2017).
pubmed: 27256312 doi: 10.1093/bioinformatics/btw290
Popic, V., Kuleshov, V., Snyder, M. & Batzoglou, S. Fast metagenomic binning via hashing and Bayesian clustering. J. Comput. Biol. 25, 677–688 (2018).
pubmed: 29658784 doi: 10.1089/cmb.2017.0250
DeMaere, M. Z. & Darling, A. E. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes (MAGs). Genome Biol. 20, 46 (2019).
pubmed: 30808380 pmcid: 6391755 doi: 10.1186/s13059-019-1643-1
Marbouty, M., Baudry, L., Cournac, A. & Koszul, R. Scaffolding bacterial genomes and probing host-virus interactions in gut microbiome by proximity ligation (chromosome capture) assay. Sci. Adv. 3, e1602105 (2017).
pubmed: 28232956 pmcid: 5315449 doi: 10.1126/sciadv.1602105
Xu, Y. & Zhao, F. Single-cell metagenomics: challenges and applications. Protein Cell 9, 501–510 (2018).
pubmed: 29696589 pmcid: 5960468 doi: 10.1007/s13238-018-0544-5
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819 (2017).
pubmed: 28387841 doi: 10.1093/molbev/msx116
Sedlazeck, F. J., Rescheneder, P. & von Haeseler, A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29, 2790–2791 (2013).
pubmed: 23975764 doi: 10.1093/bioinformatics/btt468
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
pubmed: 33590861 pmcid: 7931819 doi: 10.1093/gigascience/giab008
Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 49, D373–D379 (2021).
pubmed: 33174605 doi: 10.1093/nar/gkaa1007
Dylus, D., Altenhoff, A. & Majidian, S. Jupyter notebooks and scripts for the Read2Tree paper. GitHub https://github.com/dvdylus/read2tree_paper (2023).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
pubmed: 21572440 pmcid: 3571712 doi: 10.1038/nbt.1883
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
pubmed: 26921390 pmcid: 4868116 doi: 10.1093/molbev/msw046
Galili, T. dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31, 3718–3720 (2015).
pubmed: 26209431 pmcid: 4817050 doi: 10.1093/bioinformatics/btv428
Robinson, O., Dylus, D. & Dessimoz, C. Phylo.io: interactive viewing and comparison of large phylogenetic trees on the web. Mol. Biol. Evol. 33, 2163–2166 (2016).
pubmed: 27189561 pmcid: 4948708 doi: 10.1093/molbev/msw080
Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF—a simulation framework for genome evolution. Mol. Biol. Evol. 29, 1115–1123 (2011).
pubmed: 22160766 pmcid: 3341827 doi: 10.1093/molbev/msr268
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
pubmed: 22199392 doi: 10.1093/bioinformatics/btr708
Simonsen, M., Mailund, T. & Pedersen, C. N. S. in Algorithms in Bioinformatics 113–122 (Springer Berlin Heidelberg, 2008)
Dylus, D., Altenhoff, A. & Majidian, S. Read2Tree: a tool for inferring species tree from sequencing reads. GitHub https://github.com/DessimozLab/read2tree (2023).

Auteurs

David Dylus (D)

Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
F. Hoffmann-La Roche Ltd, Immunology, Infectious Disease, and Ophthalmology (I2O), Roche Pharmaceutical Research and Early Development (pRED), Basel, Switzerland.

Adrian Altenhoff (A)

SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Department of Computer Science, ETH, Zurich, Switzerland.

Sina Majidian (S)

Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.

Fritz J Sedlazeck (FJ)

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. Fritz.Sedlazeck@bcm.edu.
Department of Computer Science, Rice University, Houston, TX, USA. Fritz.Sedlazeck@bcm.edu.

Christophe Dessimoz (C)

Department of Computational Biology, University of Lausanne, Lausanne, Switzerland. Christophe.Dessimoz@unil.ch.
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland. Christophe.Dessimoz@unil.ch.
Department of Computer Science, University College London, London, UK. Christophe.Dessimoz@unil.ch.
Centre for Life's Origins and Evolution, Department of Genetics, Evolution and Environment, University College London, London, UK. Christophe.Dessimoz@unil.ch.

Classifications MeSH