Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.
Journal
Database : the journal of biological databases and curation
ISSN: 1758-0463
Titre abrégé: Database (Oxford)
Pays: England
ID NLM: 101517697
Informations de publication
Date de publication:
29 04 2021
29 04 2021
Historique:
received:
14
08
2020
revised:
11
03
2021
accepted:
16
04
2021
entrez:
29
4
2021
pubmed:
30
4
2021
medline:
29
10
2021
Statut:
ppublish
Résumé
High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.
Identifiants
pubmed: 33914028
pii: 6259052
doi: 10.1093/database/baab021
pmc: PMC8083811
pii:
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : NIH HHS
ID : DP5 OD017937
Pays : United States
Organisme : NIGMS NIH HHS
ID : P41 GM103504
Pays : United States
Organisme : NIGMS NIH HHS
ID : T32 GM008806
Pays : United States
Organisme : CIHR
ID : FL-000655
Pays : Canada
Informations de copyright
© The Author(s) 2021. Published by Oxford University Press.
Références
Sci Data. 2019 Feb 19;6:190021
pubmed: 30778255
Nucleic Acids Res. 2002 Jan 1;30(1):207-10
pubmed: 11752295
Biophys Rev. 2019 Feb;11(1):103-110
pubmed: 30594974
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
BMC Bioinformatics. 2018 Jul 16;19(1):268
pubmed: 30012108
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21
pubmed: 21062823
Neural Netw. 2005 Jun-Jul;18(5-6):602-10
pubmed: 16112549
BMC Bioinformatics. 2017 Sep 18;18(1):415
pubmed: 28923003
Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63
pubmed: 22139929
Nat Biotechnol. 2017 Apr 11;35(4):319-321
pubmed: 28398307
Nat Commun. 2018 Apr 10;9(1):1366
pubmed: 29636450
Nat Immunol. 2017 Nov 16;18(12):1274-1278
pubmed: 29144493
ScientificWorldJournal. 2009 May 29;9:420-3
pubmed: 19484163
Nucleic Acids Res. 2012 Jan;40(Database issue):D64-70
pubmed: 22096232
Bioinformatics. 2017 Sep 15;33(18):2914-2923
pubmed: 28535296