Does clustering of DNA barcodes agree with botanical classification directly at high taxonomic levels? Trees in French Guiana as a case study.
French Guianan trees
Ward method
barcoding
clustering
stochastic block model
taxonomy
Journal
Molecular ecology resources
ISSN: 1755-0998
Titre abrégé: Mol Ecol Resour
Pays: England
ID NLM: 101465604
Informations de publication
Date de publication:
Jul 2022
Jul 2022
Historique:
received:
04
02
2021
accepted:
16
12
2021
pubmed:
8
1
2022
medline:
9
6
2022
entrez:
7
1
2022
Statut:
ppublish
Résumé
Characterizing biodiversity is one of the main challenges for the coming decades. Most diversity has not been morphologically described, and barcoding is now complementing morphological-based taxonomy to further develop inventories. Both approaches have been cross-validated at the level of species and OTUs. However, many known species are not listed in reference databases. One path to speed up inventories using barcoding is to directly identify individuals at coarser taxonomic levels. We therefore studied in barcoding of plants whether morphological-based and molecular-based approaches are in agreement at genus, family and order levels. We used Agglomerative Hierarchical Clustering (with Ward, Complete and Single Linkage) and Stochastic Block Models (SBM), with two dissimilarity measures (Smith-Waterman scores, kmers). The agreement between morphological-based and molecular-based classifications ranges in most of the cases from good to very good at taxonomic levels above species, even though it decreases when taxonomic levels increase, or when using the tetramer-based distance. Agreement is correlated with the entropy of morphological-based classification and with the ratio of the mean within- and mean between-groups dissimilarities. The Ward method globally leads to the best agreement, whereas Single Linkage can show poor behaviours. SBM provides a useful tool to test whether or not the dissimilarities are structured by the botanical groups. These results suggest that automatic clustering and group identification at taxonomic levels above species are possible in barcoding.
Identifiants
pubmed: 34995403
doi: 10.1111/1755-0998.13579
doi:
Banques de données
RefSeq
['KX247940', 'KX249593']
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
1746-1761Subventions
Organisme : INRAE
Organisme : Agence Nationale de la Recherche
ID : ANR-10-LABX-25-01
Organisme : Institut national de recherche en informatique et en automatique (INRIA)
Informations de copyright
© 2022 John Wiley & Sons Ltd.
Références
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410.
Barbillon, P., Donnet, S., Lazega, E., & Bar-Hen, A. (2017). Stochastic block-models for multiplex networks: An application to a multilevel network of researchers. Journal of the Royal Statistical Society: Series A Statistics in Society, 180(1), 295-314.
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719-725.
Bik, H. M., Porazinska, D. L., Creer, S., Caporaso, J. G., Knight, R., & Thomas, W. K. (2012). Sequencing our way towards understanding global eukaryotic biodiversity. Trends in Ecology and Evolution, 27, 233-243.
Blaxter, M., Mann, J., Chapman, T., Thomas, F., Whitton, C., Floyd, R., & Abebe, E. (2005). Defining operational taxonomic units using DNA barcode data. Philosophical Transactions of the Royal Society B, 360, 1935-1943.
Caron, H., Molino, J.-F., Sabatier, D., Léger, P., Chaumeil, P., Scotti-Saintagne, C., Frigério, J.-M., Scotti, I., Franc, A., & Petit, R. J. (2019). Chloroplast DNA variation in a hyperdiverse tropical tree community. Ecology and Evolution, 9(8), 4897-4905.
Cole, A. J. (Ed.) (1969). Numerical taxonomy. Academic Press.
Cox, T., & Cox, M. A. A. (2001). Multidimensional Scaling - Second edition, Volume 88 of Monographs on Statistics and Applied Probability. Chapman & Hall.
Daudin, J.-J., Picard, F., & Robin, S. (2008). A mixture model for random graphs. Statistics and Computing, 18, 173-183.
Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461.
Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., Foufou, S., & Bouras, B. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics on Computing, 2(4), 267-279.
Faskowitz, J., Yan, X., Zuo, X.-N., & Sporns, O. (2018). Weighted stochastic block models of the human connectome across the life span. Scientific Reports, 8, 12997. https://doi.org/10.1038/s41598-018-31202-1
Felsenstein, J. (2004). Inferring phylogenies. Sinauer.
Floyd, R., Abebe, E., Papert, A., & Blaxter, M. (2002). Molecular barcodes for soil nematode identification. Molecular Ecology, 11, 839-850.
Fontaneto, D., Boschetti, C., & Ricci, C. (2008). Cryptic diversification in ancient asexuals: Evidence from the bdelloid rotifer Philodina aviceps. Journal of Evolutionary Biology, 21, 580-587.
Friedman, W. E. (2009). The meaning of Darwin's “abominable mystery”. American Journal of Botany, 96(1), 5-21.
Frigerio, J.-M., Caron, H., Sabatier, D., Molino, J.-F., & Franc, A. (2021). Guiana trees. Portail Data INRAE, V1. https://doi.org/10.15454/XSJ079
Gusfield, D. (1997). Algorithms on strings, trees, and sequences. Cambridge University Press.
Hajibabaei, M., Shokralla, S., Zhou, X., Singer, G. A. C., & Baird, D. J. (2011). Environmental barcoding: A next generation sequencing approach for biomonitoring applications using river benthos. PLoS One, 6(4), e17497.
Hao, X., Jiang, R., & Chen, T. (2011). Clustering 16S rRNA for OTU prediction: A method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611-618.
Hebert, P. D. N., Cywinska, A., Ball, S. L., & deWaard, J. R. (2003). Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences, 270, 313-321.
Hillis, D. M., Moritz, C., & Mable, B. (1996). Molecular systematics. Sinauer.
Holland, P., Laskey, K., & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109-137.
Hollingsworth, P. M., Forrest, L. L., Spouge, J. L., Hajibabaei, M., Ratnasingham, S., van der Bank, M., Chase, M. W., Cowan, R. S., Erickson, D. L., Fazekas, A. J., Graham, S. W., James, K. E., Kim, K.-J., Kress, W. J., Schneider, H., van AlphenStahl, J., Barrett, S. C. H., van den Berg, C., Bogarin, D., … Little, D. P. (2009). A DNA barcode for land plants. Proceedings of the National Academy of Sciences of the United States of America, 106, 12794-12797.
Hollingsworth, P. M., Graham, S. W., & Little, D. P. (2011). Choosing and using a plant DNA barcode. PLoS One, 6(5), e19254.
Husson, F., Pages, J., & Le, S. (2010). Exploratory multivariate analysis by example using R. Chapman & Hall/CRC. Computer Science & Data Analysis. CRC Press Taylor & Francis.
Izenman, A. J. (2008). Modern multivariate statistical techniques. Springer.
Ji, Y., Ashton, L., Pedley, S. M., Edwards, D. P., Tang, Y., Nakamura, A., Kitching, R., Dolman, P. M., Woodcock, P., Edwards, F. A., Larsen, T. H., Hsu, W. X., Benedick, S., Hamer, K. C., Wilcove, D. S., Bruce, C., Wang, X., Levi, T., Lott, M., … Yu, D. W. (2013). Reliable, verifiable and efficient monitoring of biodiversity via metabarcoding. Ecology Letters, 16(10), 1245-1257.
Kermarrec, L., Franc, A., Rimet, F., Chaumeil, P., Humbert, J.-F., & Bouchez, A. (2013). Next-generation sequencing to inventory taxonomic diversity in eukaryotic communities: A test for freshwater diatoms. Molecular Ecology Resources, 13, 607-619.
Lee, C., & Wilkinson, D. (2019). A review of stochastic block models and extensions for graph clustering. Applied Network Science, 4(122), 31. https://doi.org/10.1007/s41109-019-0232-2
Leray, M., & Knowlton, N. (2015). DNA barcoding and metabarcoding of standardized samples reveal patterns of marine benthic diversity. Proceedings of the National Academy of Sciences of the United States of America, 112(7), 2076-2081.
Levenstein, V. I. (1966). Binary codes capable of correcting insertions and reversals. Soviet Physics-Doklady, 10, 707-710.
Limpiti, T., Amornbunchornvej, C., Intarapanich, A., Assawamakin, A., & Tongsima, S. (2014). injclust: Iterative neighbor-joining tree clustering framework for inferring population structure. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11, 903-914.
López-García, P., Rodriguez-Valera, F., Pedros-Alio, C., & Moreira, D. (2001). Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton. Nature, 409, 603-607.
Madelaine, C., Pélissier, R., Vincent, G., Molino, J. F., Sabatier, D., Prévost, M. F., & de Namur, C. (2007). Mortality and recruitment in a lowland tropical rain forest of French Guiana: Effects of soil type and species guild. Journal of Tropical Ecology, 23, 277-287.
Mahé, F., Czech, L., Stamatakis, A., Quince, C., de Vargas, C., Dunthorn, M., & Rognes, T. (2019). Swarm v3: Towards tera-scale amplicon clustering. Bioinformatics.
Mahé, F., Rognes, T., Quince, C., de Vargas, C., & Dunthorn, M. (2014). Swarm: Robust and fast clustering method for amplicon-based studies. PeerJ, 2, e593.
Mahé, F., Rognes, T., Quince, C., de Vargas, C., & Dunthorn, M. (2015). Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ, 3, e1420.
Meiklejohn, K. A., Damaso, N., & Robertson, J. (2019). Assessment of BOLD and Genbank - Their accuracy and reliability for the identification of biological materials. PLoS One, 14(6), e0217084. https://doi.org/10.1371/journal.pone.0217084
Miele, V., & Matias, C. (2017). Revealing the hidden structure of dynamic ecological networks. Royal Society Open Science, 4(6), 170251. https://doi.org/10.1098/rsos.170251
Müllner, D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, 53(9), 1-18.
Munch, K., Boomsma, W., Huelsenbeck, J. P., Willerslev, E., & Nielsen, R. (2008). Statistical assignment of DNA sequences using Bayesian phylogenetics. Systematic Biology, 57(5), 750-757.
Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4), 354-359.
Navlakha, S., White, J., Nagarajan, N., Pop, M., & Kingsford, C. (2010). Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information. Journal of Computational Biology, 17(3), 503-516.
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to search for similarities in the amino-acid sequence of two proteins. Journal of Molecular Biology, 48, 443-453.
Pang, X., Liu, C., Shi, L., Liu, R., Liang, D., Li, H., Cherny, S. S., & Chen, S. (2012). Utility of the trnH-psbA Intergenic Spacer Region and its combinations as Plant DNA Barcodes: A meta-analysis. PLoS One, 7(11), e48833.
Petit, R. J., & Excoffier, L. (2009). Gene flow and species delimitation. Trends in Ecology and Evolution, 24(7), 386-393.
Pfitzner, D., Leibbrandt, R., & Powers, D. (2009). Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3), 361-394. https://doi.org/10.1007/s10115-008-0150-6
Pons, J., Barraclough, T. G., Gomez-Zurita, J., Cardoso, A., Duran, D. P., Hazell, S., Kamoun, S., Sumlin, W. D., & Vogler, A. (2006). Sequence-based species delimitation for the DNA taxonomy of undescribed insects. Systematic Biology, 55(4), 595-609.
Puillandre, N., Modica, M. V., Zhang, Y., Sirovich, L., Boisselier, M.-C., Cruaud, C., Holford, M., & Samadi, S. (2012). Large-scale species delimitation method for hyperdiverse groups. Molecular Ecology, 21, 2671-2691.
Ramirez-Barahona, S., Sauquet, H., & Magallon, S. (2020). The delayed and geographically heterogeneous diversification of flowering plant families. Nature Ecology & Evolution, 4, 1232-1238.
Ros, F., & Guillaume, S. (2019). A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise. Expert Systems with Applications, 128, 96-108.
Sabatier, D., Grimaldi, M., Prévost, M., Guillaume, J., Godron, M., Dosso, M., & Curmi, P. (1997). The influence of soil cover organization on the floristic and structural heterogeneity of a Guianan rain forest. Plant Ecology, 131, 81-108.
Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406-425.
Smith, P. D., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195-197.
Sneath, R. H. A., & Sokal, R. R. (1973). Numerical taxonomy. Freeman.
Sogin, M. L., Morrison, H. G., Huber, J. A., Welch, D. M., Huse, S. M., Neal, P. R., Arrieta, J. M., & Herndl, G. J. (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences of the United States of America, 103(32), 12115-12120.
Sun, Y., Cai, Y., Liu, L., Yu, F., Farrell, M. L., McKendree, W., & Farmerie, W. (2009). ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research, 37(10), e76.
Taberlet, P., Coissac, E., Pompanon, F., Brochman, C., & Willerslev, E. (2012). Towards next-generation biodiversity assessment using DNA metabarcoding. Molecular Ecology, 21, 2045-2050.
Talavera, G., Dinca, V., & Vila, R. (2013). Factors affecting species delimitations with the GMYC model: Insights from a buttery survey. Methods in Ecology and Evolution, 4, 1101-1110.
The Angiosperm Phylogeny Group, Chase, M. W., Christenhusz, M. J. M., Fay, M. F., Byng, J. W., Judd, W. S., Soltis, D. E., Mabberley, D. J., Sennikov, A. N., Soltis, P. S., & Stevens, P. F. (2016). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Botanical Journal of the Linnean Society, 181(1), 1-20.
Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17(4), 401-419.
van der Maaten, L., & Hinton, G. (2008). Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9, 2579-2605.
White, J. R., Navlakha, S., Nagarajan, N., Ghodsi, M.-R., Kingsford, C., & Pop, M. (2010). Alignment and clustering of phylogenetic markers - Implications for microbial diversity studies. BMC Bioinformatics, 11(152), 1471-2105.
Yang, Z. (2006). Computational molecular evolution. Oxford series in ecology and evolution. Oxford University Press.
Zhang, J., Kapli, P., Pavlidis, P., & Satamakis, A. (2013). A general species delimitation method with applications to phylogenetic placements. Bioinformatics, 29(22), 2869-2876.
Zumel, N., & Mount, J. (2014). Practical data science with R. Manning.