Systematic tissue annotations of genomics samples by modeling unstructured metadata.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
08 11 2022
Historique:
received: 09 06 2021
accepted: 25 10 2022
entrez: 8 11 2022
pubmed: 9 11 2022
medline: 11 11 2022
Statut: epublish

Résumé

There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .

Identifiants

pubmed: 36347858
doi: 10.1038/s41467-022-34435-x
pii: 10.1038/s41467-022-34435-x
pmc: PMC9643451
doi:

Types de publication

Journal Article Research Support, U.S. Gov't, Non-P.H.S. Research Support, N.I.H., Extramural

Langues

eng

Sous-ensembles de citation

IM

Pagination

6736

Subventions

Organisme : NIGMS NIH HHS
ID : R35 GM128765
Pays : United States

Informations de copyright

© 2022. The Author(s).

Références

Nat Rev Genet. 2020 Oct;21(10):615-629
pubmed: 32694666
J Biomed Inform. 2017 Aug;72:132-139
pubmed: 28625880
Bioinformatics. 2004 Feb 12;20(3):307-15
pubmed: 14960456
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275
pubmed: 33208538
Sci Data. 2019 Oct 31;6(1):251
pubmed: 31672978
Database (Oxford). 2021 Apr 29;2021:
pubmed: 33914028
Nucleic Acids Res. 2015 Jan;43(Database issue):D1113-6
pubmed: 25361974
Database (Oxford). 2017 Jan 1;2017:
pubmed: 29220475
Genome Biol. 2012 Jan 31;13(1):R5
pubmed: 22293552
Nucleic Acids Res. 2018 May 18;46(9):e54
pubmed: 29514223
Pac Symp Biocomput. 2008;:580-91
pubmed: 18229717
Front Genet. 2019 Feb 25;10:126
pubmed: 30858865
Bioinformatics. 2008 Dec 1;24(23):2798-800
pubmed: 18842599
Bioinformatics. 2013 Dec 1;29(23):3036-44
pubmed: 24037214
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Nucleic Acids Res. 2021 Jan 8;49(D1):D1502-D1506
pubmed: 33211879
Nat Commun. 2019 Aug 5;10(1):3512
pubmed: 31383865
Front Genet. 2020 Dec 10;11:610798
pubmed: 33362867
Sci Data. 2017 Sep 19;4:170125
pubmed: 28925997
Biophys Rev. 2019 Feb;11(1):103-110
pubmed: 30594974
Nucleic Acids Res. 2005 Nov 10;33(20):e175
pubmed: 16284200
Genome Biol. 2005;6(2):R21
pubmed: 15693950
Nat Commun. 2016 Sep 26;7:12846
pubmed: 27667448
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Brief Bioinform. 2017 May 1;18(3):403-412
pubmed: 27142216
Nucleic Acids Res. 2019 Jan 8;47(D1):D1172-D1178
pubmed: 30407529
Bioinformatics. 2020 May 1;36(9):2821-2828
pubmed: 31960892
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5
pubmed: 23193258
Healthcare (Basel). 2020 Apr 30;8(2):
pubmed: 32365973
Nucleic Acids Res. 2019 Jan 8;47(D1):D442-D450
pubmed: 30395289
Cell Syst. 2019 Feb 27;8(2):152-162.e6
pubmed: 30685436
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509
pubmed: 29297276
Bioinformatics. 2017 Sep 15;33(18):2914-2923
pubmed: 28535296
Database (Oxford). 2016 Jan 1;2016:
pubmed: 28637268
J Am Med Inform Assoc. 2010 May-Jun;17(3):229-36
pubmed: 20442139
Nat Genet. 2001 Dec;29(4):365-71
pubmed: 11726920
Genome Biol. 2021 Apr 15;22(1):106
pubmed: 33858487
BMC Bioinformatics. 2020 Sep 3;21(1):378
pubmed: 32883210
Biostatistics. 2010 Apr;11(2):242-53
pubmed: 20097884
BMC Bioinformatics. 2009 Feb 05;10 Suppl 2:S1
pubmed: 19208184

Auteurs

Nathaniel T Hawkins (NT)

Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.

Marc Maldaver (M)

Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.

Anna Yannakopoulos (A)

Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.

Lindsay A Guare (LA)

Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA.
Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, 48824, USA.

Arjun Krishnan (A)

Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA. arjun.krishnan@cuanschutz.edu.
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA. arjun.krishnan@cuanschutz.edu.
Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA. arjun.krishnan@cuanschutz.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH