Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.


Journal

Database : the journal of biological databases and curation
ISSN: 1758-0463
Titre abrégé: Database (Oxford)
Pays: England
ID NLM: 101517697

Informations de publication

Date de publication:
29 04 2021
Historique:
received: 14 08 2020
revised: 11 03 2021
accepted: 16 04 2021
entrez: 29 4 2021
pubmed: 30 4 2021
medline: 29 10 2021
Statut: ppublish

Résumé

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.

Identifiants

pubmed: 33914028
pii: 6259052
doi: 10.1093/database/baab021
pmc: PMC8083811
pii:
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : NIH HHS
ID : DP5 OD017937
Pays : United States
Organisme : NIGMS NIH HHS
ID : P41 GM103504
Pays : United States
Organisme : NIGMS NIH HHS
ID : T32 GM008806
Pays : United States
Organisme : CIHR
ID : FL-000655
Pays : Canada

Informations de copyright

© The Author(s) 2021. Published by Oxford University Press.

Références

Sci Data. 2019 Feb 19;6:190021
pubmed: 30778255
Nucleic Acids Res. 2002 Jan 1;30(1):207-10
pubmed: 11752295
Biophys Rev. 2019 Feb;11(1):103-110
pubmed: 30594974
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
BMC Bioinformatics. 2018 Jul 16;19(1):268
pubmed: 30012108
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21
pubmed: 21062823
Neural Netw. 2005 Jun-Jul;18(5-6):602-10
pubmed: 16112549
BMC Bioinformatics. 2017 Sep 18;18(1):415
pubmed: 28923003
Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63
pubmed: 22139929
Nat Biotechnol. 2017 Apr 11;35(4):319-321
pubmed: 28398307
Nat Commun. 2018 Apr 10;9(1):1366
pubmed: 29636450
Nat Immunol. 2017 Nov 16;18(12):1274-1278
pubmed: 29144493
ScientificWorldJournal. 2009 May 29;9:420-3
pubmed: 19484163
Nucleic Acids Res. 2012 Jan;40(Database issue):D64-70
pubmed: 22096232
Bioinformatics. 2017 Sep 15;33(18):2914-2923
pubmed: 28535296

Auteurs

Adam Klie (A)

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.

Brian Y Tsui (BY)

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.

Shamim Mollah (S)

Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.
Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA.
Department of Genetics, Washington University in St. Louis, St. Louis, MO 63130, USA.

Dylan Skola (D)

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.

Michelle Dow (M)

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA.

Chun-Nan Hsu (CN)

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.
Department of Neurosciences, University of California San Diego, La Jolla, CA 92093, USA.

Hannah Carter (H)

Department of Medicine, Division of Medical Genetics, University of California San Diego, La Jolla, CA 92093, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning

Classifications MeSH