Combining natural language processing and metabarcoding to reveal pathogen-environment associations.


Journal

PLoS neglected tropical diseases
ISSN: 1935-2735
Titre abrégé: PLoS Negl Trop Dis
Pays: United States
ID NLM: 101291488

Informations de publication

Date de publication:
04 2021
Historique:
received: 27 08 2020
accepted: 09 03 2021
revised: 19 04 2021
pubmed: 8 4 2021
medline: 29 7 2021
entrez: 7 4 2021
Statut: epublish

Résumé

Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year-with 180,000 resulting deaths-mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.

Identifiants

pubmed: 33826634
doi: 10.1371/journal.pntd.0008755
pii: PNTD-D-20-01541
pmc: PMC8055023
doi:

Substances chimiques

RNA, Ribosomal, 18S 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e0008755

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

Emerg Infect Dis. 2010 Jan;16(1):14-20
pubmed: 20031037
Int J Microbiol. 2016;2016:4080725
pubmed: 26884765
Am J Public Health. 1998 Oct;88(10):1545-53
pubmed: 9772861
mBio. 2014 Jun 03;5(3):e01101-14
pubmed: 24895304
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21
pubmed: 21062823
Mol Ecol. 2017 Nov;26(21):5872-5895
pubmed: 28921802
Emerg Infect Dis. 2010 Feb;16(2):251-7
pubmed: 20113555
Front Cell Infect Microbiol. 2019 Nov 08;9:384
pubmed: 31788454
Annu Rev Microbiol. 2008;62:19-33
pubmed: 18785836
PLoS Comput Biol. 2007 Dec;3(12):e252
pubmed: 18069887
Ann N Y Acad Sci. 1950 Sep;50(10):1299-1314
pubmed: 14783320
Fungal Genet Biol. 2015 May;78:16-48
pubmed: 25721988
mBio. 2017 Jan 31;8(1):
pubmed: 28143979
Science. 1996 Dec 20;274(5295):2025-31
pubmed: 8953025
Med Mycol. 2017 Oct 1;55(7):794-797
pubmed: 28115408
Cold Spring Harb Perspect Med. 2014 Jul 01;4(7):a019760
pubmed: 24985132
Edinb Med J. 1856 Jan;1(7):668-670
pubmed: 29647347
Lancet Infect Dis. 2017 Aug;17(8):873-881
pubmed: 28483415
Med Mycol. 2000 Oct;38(5):379-83
pubmed: 11092385
PLoS One. 2017 Feb 17;12(2):e0171695
pubmed: 28212396
Proc Natl Acad Sci U S A. 2011 May 24;108(21):8529-30
pubmed: 21562216

Auteurs

David C Molik (DC)

Department of Biological Sciences, University of Notre Dame, Notre Dame, Indiana, United States of America.
Navari Center for Digital Scholarship, University of Notre Dame, Notre Dame, Indiana, United States of America.

DeAndre Tomlinson (D)

Science-Computing Program, University of Notre Dame, Notre Dame, Indiana, United States of America.

Shane Davitt (S)

Department of Biological Sciences, University of Notre Dame, Notre Dame, Indiana, United States of America.

Eric L Morgan (EL)

Navari Center for Digital Scholarship, University of Notre Dame, Notre Dame, Indiana, United States of America.

Matthew Sisk (M)

Navari Center for Digital Scholarship, University of Notre Dame, Notre Dame, Indiana, United States of America.

Benjamin Roche (B)

Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America.

Natalie Meyers (N)

Navari Center for Digital Scholarship, University of Notre Dame, Notre Dame, Indiana, United States of America.

Michael E Pfrender (ME)

Department of Biological Sciences, University of Notre Dame, Notre Dame, Indiana, United States of America.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH