Systematic tissue annotations of genomics samples by modeling unstructured metadata.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
08 11 2022
08 11 2022
Historique:
received:
09
06
2021
accepted:
25
10
2022
entrez:
8
11
2022
pubmed:
9
11
2022
medline:
11
11
2022
Statut:
epublish
Résumé
There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .
Identifiants
pubmed: 36347858
doi: 10.1038/s41467-022-34435-x
pii: 10.1038/s41467-022-34435-x
pmc: PMC9643451
doi:
Types de publication
Journal Article
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, N.I.H., Extramural
Langues
eng
Sous-ensembles de citation
IM
Pagination
6736Subventions
Organisme : NIGMS NIH HHS
ID : R35 GM128765
Pays : United States
Informations de copyright
© 2022. The Author(s).
Références
Nat Rev Genet. 2020 Oct;21(10):615-629
pubmed: 32694666
J Biomed Inform. 2017 Aug;72:132-139
pubmed: 28625880
Bioinformatics. 2004 Feb 12;20(3):307-15
pubmed: 14960456
Proc Natl Acad Sci U S A. 2020 Dec 1;117(48):30266-30275
pubmed: 33208538
Sci Data. 2019 Oct 31;6(1):251
pubmed: 31672978
Database (Oxford). 2021 Apr 29;2021:
pubmed: 33914028
Nucleic Acids Res. 2015 Jan;43(Database issue):D1113-6
pubmed: 25361974
Database (Oxford). 2017 Jan 1;2017:
pubmed: 29220475
Genome Biol. 2012 Jan 31;13(1):R5
pubmed: 22293552
Nucleic Acids Res. 2018 May 18;46(9):e54
pubmed: 29514223
Pac Symp Biocomput. 2008;:580-91
pubmed: 18229717
Front Genet. 2019 Feb 25;10:126
pubmed: 30858865
Bioinformatics. 2008 Dec 1;24(23):2798-800
pubmed: 18842599
Bioinformatics. 2013 Dec 1;29(23):3036-44
pubmed: 24037214
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Nucleic Acids Res. 2021 Jan 8;49(D1):D1502-D1506
pubmed: 33211879
Nat Commun. 2019 Aug 5;10(1):3512
pubmed: 31383865
Front Genet. 2020 Dec 10;11:610798
pubmed: 33362867
Sci Data. 2017 Sep 19;4:170125
pubmed: 28925997
Biophys Rev. 2019 Feb;11(1):103-110
pubmed: 30594974
Nucleic Acids Res. 2005 Nov 10;33(20):e175
pubmed: 16284200
Genome Biol. 2005;6(2):R21
pubmed: 15693950
Nat Commun. 2016 Sep 26;7:12846
pubmed: 27667448
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Brief Bioinform. 2017 May 1;18(3):403-412
pubmed: 27142216
Nucleic Acids Res. 2019 Jan 8;47(D1):D1172-D1178
pubmed: 30407529
Bioinformatics. 2020 May 1;36(9):2821-2828
pubmed: 31960892
Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5
pubmed: 23193258
Healthcare (Basel). 2020 Apr 30;8(2):
pubmed: 32365973
Nucleic Acids Res. 2019 Jan 8;47(D1):D442-D450
pubmed: 30395289
Cell Syst. 2019 Feb 27;8(2):152-162.e6
pubmed: 30685436
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509
pubmed: 29297276
Bioinformatics. 2017 Sep 15;33(18):2914-2923
pubmed: 28535296
Database (Oxford). 2016 Jan 1;2016:
pubmed: 28637268
J Am Med Inform Assoc. 2010 May-Jun;17(3):229-36
pubmed: 20442139
Nat Genet. 2001 Dec;29(4):365-71
pubmed: 11726920
Genome Biol. 2021 Apr 15;22(1):106
pubmed: 33858487
BMC Bioinformatics. 2020 Sep 3;21(1):378
pubmed: 32883210
Biostatistics. 2010 Apr;11(2):242-53
pubmed: 20097884
BMC Bioinformatics. 2009 Feb 05;10 Suppl 2:S1
pubmed: 19208184