CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats.
Journal
Database : the journal of biological databases and curation
ISSN: 1758-0463
Titre abrégé: Database (Oxford)
Pays: England
ID NLM: 101517697
Informations de publication
Date de publication:
01 01 2020
01 01 2020
Historique:
received:
21
05
2020
revised:
07
09
2020
accepted:
10
09
2020
entrez:
18
11
2020
pubmed:
19
11
2020
medline:
29
10
2021
Statut:
ppublish
Résumé
The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.
Identifiants
pubmed: 33206958
pii: 5989497
doi: 10.1093/database/baaa088
pmc: PMC7673337
pii:
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2020. Published by Oxford University Press.
Références
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432
pubmed: 30357350
J Bacteriol. 2018 Mar 12;200(7):
pubmed: 29358495
DNA Res. 2015 Dec;22(6):439-49
pubmed: 26494834
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24
pubmed: 27342282
Sci Rep. 2015 Feb 10;5:8365
pubmed: 25666585
BMC Bioinformatics. 2007 Jun 18;8:209
pubmed: 17577412
Proc Natl Acad Sci U S A. 2016 May 31;113(22):E3057
pubmed: 27173901
Nucleic Acids Res. 2020 Jan 8;48(D1):D535-D544
pubmed: 31624845
CRISPR J. 2018 Apr;1:171-181
pubmed: 31021201
Nucleic Acids Res. 2019 Jan 8;47(D1):D649-D659
pubmed: 30357420
Nucleic Acids Res. 2003 Jul 1;31(13):3723-6
pubmed: 12824403
Genome Biol. 2019 May 16;20(1):92
pubmed: 31097009
Nucleic Acids Res. 2011 Nov 1;39(20):8792-802
pubmed: 21771858
Nucleic Acids Res. 2018 Jul 2;46(W1):W246-W251
pubmed: 29790974
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W52-7
pubmed: 17537822
Genome Res. 2019 Jun;29(6):954-960
pubmed: 31064768
Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515
pubmed: 30395287
Genes Dev. 2000 Mar 15;14(6):719-30
pubmed: 10733531
BMC Bioinformatics. 2007 Jan 20;8:18
pubmed: 17239253
Nat Rev Microbiol. 2020 Feb;18(2):67-83
pubmed: 31857715
Database (Oxford). 2013 Jul 17;2013:bat053
pubmed: 23864220
Nucleic Acids Res. 2019 Dec 2;47(21):10994-11006
pubmed: 31584084
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402
pubmed: 9254694
Nucleic Acids Res. 2017 Jan 4;45(D1):D200-D203
pubmed: 27899674
F1000Res. 2018 Mar 2;7:261
pubmed: 29721311
Stand Genomic Sci. 2015 Mar 30;10:18
pubmed: 26203331
BMC Bioinformatics. 2010 Mar 08;11:119
pubmed: 20211023
PLoS Comput Biol. 2014 Dec 04;10(12):e1003998
pubmed: 25474019
Bioinformatics. 2014 Jul 15;30(14):2068-9
pubmed: 24642063