Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree.
Journal
Nature biotechnology
ISSN: 1546-1696
Titre abrégé: Nat Biotechnol
Pays: United States
ID NLM: 9604648
Informations de publication
Date de publication:
20 Apr 2023
20 Apr 2023
Historique:
received:
18
04
2022
accepted:
16
03
2023
medline:
21
4
2023
pubmed:
21
4
2023
entrez:
20
04
2023
Statut:
aheadofprint
Résumé
Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10-100 times faster than assembly-based approaches and in most cases more accurate-the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.
Identifiants
pubmed: 37081138
doi: 10.1038/s41587-023-01753-4
pii: 10.1038/s41587-023-01753-4
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : UM1HG008898
Organisme : U.S. Department of Health & Human Services | NIH | National Institute of Allergy and Infectious Diseases (NIAID)
ID : 1U19AI144297
Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Swiss National Science Foundation)
ID : 183723
Organisme : Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Swiss National Science Foundation)
ID : 205085
Commentaires et corrections
Type : UpdateOf
Informations de copyright
© 2023. The Author(s).
Références
Woese, C. R. & Fox, G. E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
pubmed: 270744
pmcid: 432104
doi: 10.1073/pnas.74.11.5088
Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).
pubmed: 16513982
doi: 10.1126/science.1123061
Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504, 231–236 (2013).
pubmed: 24336283
doi: 10.1038/nature12779
Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
pubmed: 27572647
doi: 10.1038/nmicrobiol.2016.48
Abbosh, C. et al. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature 545, 446–451 (2017).
pubmed: 28445469
pmcid: 5812436
doi: 10.1038/nature22364
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).
pubmed: 27229144
pmcid: 4967023
doi: 10.1126/science.aaf7907
Hadfield, J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018).
pubmed: 29790939
pmcid: 6247931
doi: 10.1093/bioinformatics/bty407
Eisen, J. A. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8, 163–167 (1998).
pubmed: 9521918
doi: 10.1101/gr.8.3.163
Gaudet, P., Livstone, M. S., Lewis, S. E. & Thomas, P. D. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief. Bioinform. 12, 449–462 (2011).
pubmed: 21873635
pmcid: 3178059
doi: 10.1093/bib/bbr042
Zeng, L. et al. Resolution of deep angiosperm phylogeny using conserved nuclear genes and estimates of early divergence times. Nat. Commun. 5, 4956 (2014).
pubmed: 25249442
doi: 10.1038/ncomms5956
Delsuc, F., Tsagkogeorga, G., Lartillot, N. & Philippe, H. Additional molecular support for the new chordate phylogeny. Genesis 46, 592–604 (2008).
pubmed: 19003928
doi: 10.1002/dvg.20450
Telford, M. J., Bourlat, S. J., Economou, A., Papillon, D. & Rota-Stabelli, O. The evolution of the Ecdysozoa. Philos. Trans. R. Soc. Lond. B 363, 1529–1537 (2008).
doi: 10.1098/rstb.2007.2243
Philippe, H., Lartillot, N. & Brinkmann, H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol. Biol. Evol. 22, 1246–1253 (2005).
pubmed: 15703236
doi: 10.1093/molbev/msi111
Fernández, R., Edgecombe, G. D. & Giribet, G. Exploring phylogenetic relationships within myriapoda and the effects of matrix composition and occupancy on phylogenomic reconstruction. Syst. Biol. 65, 871–889 (2016).
pubmed: 27162151
pmcid: 4997009
doi: 10.1093/sysbio/syw041
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
pubmed: 27184599
doi: 10.1038/nrg.2016.49
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
pubmed: 34050336
pmcid: 8161719
doi: 10.1038/s41576-021-00367-3
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).
pubmed: 32424311
doi: 10.1038/s41576-020-0233-0
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
pubmed: 29599501
doi: 10.1038/s41576-018-0003-4
Lewin, H. A. et al. Earth BioGenome Project: sequencing life for the future of life. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).
pubmed: 29686065
pmcid: 5924910
doi: 10.1073/pnas.1720115115
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
pubmed: 23329690
pmcid: 3603318
doi: 10.1093/molbev/mst010
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2017).
pmcid: 5850278
doi: 10.1093/molbev/msx319
Altenhoff, A. M., Schneider, A., Gonnet, G. H. & Dessimoz, C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 39, D289–D294 (2011).
pubmed: 21113020
doi: 10.1093/nar/gkq1238
Altenhoff, A. M. et al. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res. 43, D240–D249 (2015).
pubmed: 25399418
doi: 10.1093/nar/gku1158
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
pubmed: 25371430
doi: 10.1093/molbev/msu300
Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).
pubmed: 33397413
pmcid: 7780692
doi: 10.1186/s13059-020-02229-3
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Luo, R. et al. Erratum: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 4, 30 (2015).
pubmed: 26161257
pmcid: 4496933
doi: 10.1186/s13742-015-0069-2
Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152–1163 (2019).
pubmed: 31235654
pmcid: 6633268
doi: 10.1101/gr.243212.118
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
pubmed: 27323842
pmcid: 4915045
doi: 10.1186/s13059-016-0997-x
Shen, X.-X. et al. Tempo and mode of genome evolution in the budding yeast subphylum. Cell https://doi.org/10.1016/j.cell.2018.10.023 (2018).
doi: 10.1016/j.cell.2018.10.023
pubmed: 30500530
pmcid: 6291210
Stavrou, A. A., Mixão, V., Boekhout, T. & Gabaldón, T. Misidentification of genome assemblies in public databases: the case of Naumovozyma dairenensis and proposal of a protocol to correct misidentifications. Yeast 35, 425–429 (2018).
pubmed: 29320804
doi: 10.1002/yea.3303
Stavrou, A. A., Mixão, V., Boekhout, T. & Gabaldón, T. Misidentification of genome assemblies in public databases: the case of Naumovozyma dairenensisand proposal of a protocol to correct misidentifications. Yeast 35, 425–429 (2018).
Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–273 (2020).
pubmed: 32015507
pmcid: 7095418
doi: 10.1038/s41586-020-2012-7
Li, B. et al. Discovery of bat coronaviruses through surveillance and probe capture-based next-generation sequencing. mSphere 5, e00807–e00819 (2020).
pubmed: 31996413
pmcid: 6992374
Kwok, K. T. T. et al. Genome sequence of a Minacovirus strain from a farmed mink in the Netherlands. Microbiol. Resour. Announc. 10, e01451–20 (2021).
pubmed: 33632868
pmcid: 7909093
doi: 10.1128/MRA.01451-20
Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020).
pubmed: 32015508
pmcid: 7094943
doi: 10.1038/s41586-020-2008-3
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
pubmed: 20224823
pmcid: 2835736
doi: 10.1371/journal.pone.0009490
Woo, P. C. Y., Lau, S. K. P., Huang, Y. & Yuen, K.-Y. Coronavirus diversity, phylogeny and interspecies jumping. Exp. Biol. Med. 234, 1117–1127 (2009).
doi: 10.3181/0903-MR-94
Hodcroft, E. B. et al. Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature 591, 30–33 (2021).
pubmed: 33649511
doi: 10.1038/d41586-021-00525-x
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
pubmed: 33526886
pmcid: 7961889
doi: 10.1038/s41592-020-01056-5
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
pubmed: 32686750
pmcid: 7483855
doi: 10.1038/s41587-020-0503-6
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
pubmed: 33911273
pmcid: 8081667
doi: 10.1038/s41586-021-03451-0
Miga, K. H. & Wang, T. The need for a human pangenome reference sequence. Annu. Rev. Genomics Hum. Genet. 22, 81–102 (2021).
pubmed: 33929893
pmcid: 8410644
doi: 10.1146/annurev-genom-120120-081921
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
pubmed: 35357919
pmcid: 9186530
doi: 10.1126/science.abj6987
Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360, eaar6343 (2018).
pubmed: 29880660
pmcid: 6178954
doi: 10.1126/science.aar6343
Choi, B. et al. Identifying genetic markers for a range of phylogenetic utility—from species to family level. PLoS ONE 14, e0218995 (2019).
pubmed: 31369563
pmcid: 6675087
doi: 10.1371/journal.pone.0218995
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Fernández, R., Gabaldon, T. & Dessimoz, C. Orthology: definitions, prediction, and impact on species phylogeny inference. Phylogenetics in the Genomic Era 1–568, 78-2-9575069-0-3. hal-02535070v3; https://hal.science/hal-02535070v3/file/book_hyperef_v2_ISBN.pdf (2020).
Natsidis, P., Kapli, P., Schiffer, P. H. & Telford, M. J. Systematic errors in orthology inference and their effects on evolutionary analyses. iScience 24, 102110 (2021).
pubmed: 33659875
pmcid: 7892920
doi: 10.1016/j.isci.2021.102110
Kapli, P. et al. Lack of support for Deuterostomia prompts reinterpretation of the first Bilateria. Sci. Adv. 7, eabe2741 (2021).
pubmed: 33741592
pmcid: 7978419
doi: 10.1126/sciadv.abe2741
Graham, E. D., Heidelberg, J. F. & Tully, B. J. BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation. PeerJ 5, e3035 (2017).
pubmed: 28289564
pmcid: 5345454
doi: 10.7717/peerj.3035
Lu, Y. Y., Chen, T., Fuhrman, J. A. & Sun, F. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge. Bioinformatics 33, 791–798 (2017).
pubmed: 27256312
doi: 10.1093/bioinformatics/btw290
Popic, V., Kuleshov, V., Snyder, M. & Batzoglou, S. Fast metagenomic binning via hashing and Bayesian clustering. J. Comput. Biol. 25, 677–688 (2018).
pubmed: 29658784
doi: 10.1089/cmb.2017.0250
DeMaere, M. Z. & Darling, A. E. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes (MAGs). Genome Biol. 20, 46 (2019).
pubmed: 30808380
pmcid: 6391755
doi: 10.1186/s13059-019-1643-1
Marbouty, M., Baudry, L., Cournac, A. & Koszul, R. Scaffolding bacterial genomes and probing host-virus interactions in gut microbiome by proximity ligation (chromosome capture) assay. Sci. Adv. 3, e1602105 (2017).
pubmed: 28232956
pmcid: 5315449
doi: 10.1126/sciadv.1602105
Xu, Y. & Zhao, F. Single-cell metagenomics: challenges and applications. Protein Cell 9, 501–510 (2018).
pubmed: 29696589
pmcid: 5960468
doi: 10.1007/s13238-018-0544-5
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819 (2017).
pubmed: 28387841
doi: 10.1093/molbev/msx116
Sedlazeck, F. J., Rescheneder, P. & von Haeseler, A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 29, 2790–2791 (2013).
pubmed: 23975764
doi: 10.1093/bioinformatics/btt468
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
pubmed: 33590861
pmcid: 7931819
doi: 10.1093/gigascience/giab008
Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 49, D373–D379 (2021).
pubmed: 33174605
doi: 10.1093/nar/gkaa1007
Dylus, D., Altenhoff, A. & Majidian, S. Jupyter notebooks and scripts for the Read2Tree paper. GitHub https://github.com/dvdylus/read2tree_paper (2023).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
pubmed: 21572440
pmcid: 3571712
doi: 10.1038/nbt.1883
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
pubmed: 26921390
pmcid: 4868116
doi: 10.1093/molbev/msw046
Galili, T. dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 31, 3718–3720 (2015).
pubmed: 26209431
pmcid: 4817050
doi: 10.1093/bioinformatics/btv428
Robinson, O., Dylus, D. & Dessimoz, C. Phylo.io: interactive viewing and comparison of large phylogenetic trees on the web. Mol. Biol. Evol. 33, 2163–2166 (2016).
pubmed: 27189561
pmcid: 4948708
doi: 10.1093/molbev/msw080
Dalquen, D. A., Anisimova, M., Gonnet, G. H. & Dessimoz, C. ALF—a simulation framework for genome evolution. Mol. Biol. Evol. 29, 1115–1123 (2011).
pubmed: 22160766
pmcid: 3341827
doi: 10.1093/molbev/msr268
Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
pubmed: 22199392
doi: 10.1093/bioinformatics/btr708
Simonsen, M., Mailund, T. & Pedersen, C. N. S. in Algorithms in Bioinformatics 113–122 (Springer Berlin Heidelberg, 2008)
Dylus, D., Altenhoff, A. & Majidian, S. Read2Tree: a tool for inferring species tree from sequencing reads. GitHub https://github.com/DessimozLab/read2tree (2023).