Detecting and correcting misclassified sequences in the large-scale public databases.
Journal
Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944
Informations de publication
Date de publication:
15 09 2020
15 09 2020
Historique:
received:
02
04
2020
revised:
10
06
2020
accepted:
16
06
2020
pubmed:
25
6
2020
medline:
4
3
2021
entrez:
25
6
2020
Statut:
ppublish
Résumé
As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary data are available at Bioinformatics online.
Identifiants
pubmed: 32579213
pii: 5862012
doi: 10.1093/bioinformatics/btaa586
pmc: PMC7821992
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
4699-4705Informations de copyright
© The Author(s) 2020. Published by Oxford University Press.
Références
Nucleic Acids Res. 2011 Jan;39(Database issue):D225-9
pubmed: 21109532
Nat Genet. 2000 May;25(1):25-9
pubmed: 10802651
ISME J. 2012 Mar;6(3):610-8
pubmed: 22134646
Nucleic Acids Res. 2003 Jan 1;31(1):345-7
pubmed: 12520019
Methods Mol Biol. 2017;1446:111-132
pubmed: 27812939
BMC Bioinformatics. 2019 Aug 22;20(1):436
pubmed: 31438850
Database (Oxford). 2013 Jul 17;2013:bat053
pubmed: 23864220
Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12
pubmed: 25348405
Stand Genomic Sci. 2015 Mar 30;10:18
pubmed: 26203331
Nat Commun. 2018 Jun 29;9(1):2542
pubmed: 29959318
PeerJ. 2018 Jun 12;6:e5030
pubmed: 29910992
Bioinformatics. 2012 Dec 1;28(23):3150-2
pubmed: 23060610
Bioinformatics. 2018 Jul 1;34(13):2195-2200
pubmed: 29474519
Nucleic Acids Res. 2016 Jun 20;44(11):5022-33
pubmed: 27166378
BMC Bioinformatics. 2008 Aug 27;9:353
pubmed: 18752676
Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5
pubmed: 17130148
Nucleic Acids Res. 2018 Jul 2;46(W1):W479-W485
pubmed: 29762724
Database (Oxford). 2014 Apr 04;2014:bau032
pubmed: 24705206
Mol Biol Evol. 2016 Jun;33(6):1635-8
pubmed: 26921390
Genome Res. 2019 Jun;29(6):954-960
pubmed: 31064768
PLoS Comput Biol. 2009 Dec;5(12):e1000605
pubmed: 20011109
Nucleic Acids Res. 2009 Jan;37(Database issue):D26-31
pubmed: 18940867
Nucleic Acids Res. 2003 Jan 1;31(1):365-70
pubmed: 12520024