Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.

Deep Learning High-Throughput Nucleotide Sequencing Metadata Reproducibility of Results Software

Journal

Database : the journal of biological databases and curation

ISSN: 1758-0463

Titre abrégé: Database (Oxford)

Pays: England

ID NLM: 101517697

Informations de publication

Date de publication:
29 04 2021

Historique:

received: 14 08 2020

revised: 11 03 2021

accepted: 16 04 2021

entrez: 29 4 2021

pubmed: 30 4 2021

medline: 29 10 2021

Statut: ppublish

Résumé

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.

Identifiants

DOI: 10.1093/database/baab021 PMID: 33914028 PMC: PMC8083811

pubmed: 33914028

pii: 6259052

doi: 10.1093/database/baab021

pmc: PMC8083811

pii:

doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : NIH HHS

ID : DP5 OD017937

Pays : United States

Organisme : NIGMS NIH HHS

ID : P41 GM103504

Pays : United States

Organisme : NIGMS NIH HHS

ID : T32 GM008806

Pays : United States

Organisme : CIHR

ID : FL-000655

Pays : Canada

Informations de copyright

Références

Sci Data. 2019 Feb 19;6:190021

pubmed: 30778255

Nucleic Acids Res. 2002 Jan 1;30(1):207-10

pubmed: 11752295

Biophys Rev. 2019 Feb;11(1):103-110

pubmed: 30594974

Sci Data. 2016 Mar 15;3:160018

pubmed: 26978244

BMC Bioinformatics. 2018 Jul 16;19(1):268

pubmed: 30012108

Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21

pubmed: 21062823

Neural Netw. 2005 Jun-Jul;18(5-6):602-10

pubmed: 16112549

BMC Bioinformatics. 2017 Sep 18;18(1):415

pubmed: 28923003

Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63

pubmed: 22139929

Nat Biotechnol. 2017 Apr 11;35(4):319-321

pubmed: 28398307

Nat Commun. 2018 Apr 10;9(1):1366

pubmed: 29636450

Nat Immunol. 2017 Nov 16;18(12):1274-1278

pubmed: 29144493

ScientificWorldJournal. 2009 May 29;9:420-3

pubmed: 19484163

Nucleic Acids Res. 2012 Jan;40(Database issue):D64-70

pubmed: 22096232

Bioinformatics. 2017 Sep 15;33(18):2914-2923

pubmed: 28535296

Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Subventions

Informations de copyright

Références

Auteurs

Adam Klie (A)

Brian Y Tsui (BY)

Shamim Mollah (S)

Dylan Skola (D)

Michelle Dow (M)

Chun-Nan Hsu (CN)

Hannah Carter (H)

Articles similaires

Comprehensive comparative analysis and development of molecular markers for Lasianthus species based on complete chloroplast genome sequences.

Selecting optimal software code descriptors-The case of Java.

Exploring structural diversity across the protein universe with The Encyclopedia of Domains.

Relative victimization scale: initial development and retrospective reports of the impact on mental health.

Classifications MeSH