Propagation, detection and correction of errors using the sequence database network.
Annotations
Error detection
Network analysis
Propagation
Sequence
Journal
Briefings in bioinformatics
ISSN: 1477-4054
Titre abrégé: Brief Bioinform
Pays: England
ID NLM: 100912837
Informations de publication
Date de publication:
19 11 2022
19 11 2022
Historique:
received:
10
05
2022
revised:
31
07
2022
accepted:
28
08
2022
pubmed:
21
10
2022
medline:
24
11
2022
entrez:
20
10
2022
Statut:
ppublish
Résumé
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Identifiants
pubmed: 36266246
pii: 6764545
doi: 10.1093/bib/bbac416
pmc: PMC9677457
pii:
doi:
Types de publication
Review
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : Australian Research Council Discovery Project
ID : DP190101350
Informations de copyright
© The Author(s) 2022. Published by Oxford University Press.
Références
Nucleic Acids Res. 2012 Jan;40(Database issue):D54-6
pubmed: 22009675
Biology (Basel). 2020 Sep 18;9(9):
pubmed: 32962098
Nucleic Acids Res. 2021 Jan 8;49(D1):D325-D334
pubmed: 33290552
Nucleic Acids Res. 1999 Jan 1;27(1):49-54
pubmed: 9847139
PeerJ. 2018 Jun 12;6:e5030
pubmed: 29910992
Proc Natl Acad Sci U S A. 2015 Dec 29;112(52):15976-81
pubmed: 26598659
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432
pubmed: 30357350
Sci Rep. 2021 Jan 13;11(1):1160
pubmed: 33441905
Nucleic Acids Res. 2016 Jun 20;44(11):5022-33
pubmed: 27166378
Brief Funct Genomics. 2021 Jul 17;20(4):249-257
pubmed: 33686431
Stand Genomic Sci. 2015 Nov 19;10:108
pubmed: 26594309
Nucleic Acids Res. 2021 Jan 8;49(D1):D498-D508
pubmed: 33211880
BMC Bioinformatics. 2013 Jun 01;14:172
pubmed: 23725374
Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268
pubmed: 31777944
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24
pubmed: 27342282
PLoS Comput Biol. 2021 Sep 23;17(9):e1009446
pubmed: 34555022
BMC Bioinformatics. 2017 Jul 21;18(1):350
pubmed: 28732468
Database (Oxford). 2020 Jan 1;2020:
pubmed: 32761142
Database (Oxford). 2017 Jan 08;:
pubmed: 28334741
Brief Bioinform. 2021 Mar 22;22(2):2096-2105
pubmed: 32249297
Mol Phylogenet Evol. 2013 Nov;69(2):313-9
pubmed: 22982435
Curr Opin Chem Biol. 2004 Feb;8(1):76-80
pubmed: 15036160
BMC Biol. 2016 Jun 22;14:49
pubmed: 27334346
Nucleic Acids Res. 2021 Jan 8;49(D1):D344-D354
pubmed: 33156333
BMC Bioinformatics. 2005 Feb 09;6:24
pubmed: 15703069
Genome Biol. 2019 Nov 19;20(1):244
pubmed: 31744546
Genome Biol. 2009 Feb 02;10(2):206
pubmed: 19226438
Genome Biol. 2019 May 16;20(1):92
pubmed: 31097009
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Proc Natl Acad Sci U S A. 2016 May 31;113(22):E3057
pubmed: 27173901
Nat Genet. 2000 May;25(1):25-9
pubmed: 10802651
Nat Rev Genet. 2003 Jul;4(7):508-19
pubmed: 12838343
Nucleic Acids Res. 2000 Jan 1;28(1):304-5
pubmed: 10592255
Science. 2008 Mar 21;319(5870):1616
pubmed: 18356505
Database (Oxford). 2014 Jul 22;2014:
pubmed: 25052702
Nucleic Acids Res. 2021 Jan 8;49(D1):D121-D124
pubmed: 33166387
Brief Bioinform. 2013 Jan;14(1):1-12
pubmed: 22408191
Genome Res. 2019 Jun;29(6):954-960
pubmed: 31064768
BMC Bioinformatics. 2005 Dec 14;6:302
pubmed: 16354297
PLoS Comput Biol. 2020 Nov 12;16(11):e1008325
pubmed: 33180771
Nucleic Acids Res. 2016 Jan 4;44(D1):D51-7
pubmed: 26578571
PLoS Comput Biol. 2009 Dec;5(12):e1000605
pubmed: 20011109
PLoS One. 2013 Oct 15;8(10):e75541
pubmed: 24143170
Database (Oxford). 2017 Jan 1;2017(1):
pubmed: 28365737
BMC Bioinformatics. 2007 May 22;8 Suppl 4:S3
pubmed: 17570146
Bioinformatics. 2016 Dec 1;32(23):3535-3542
pubmed: 27515739
Bioinformatics. 2020 Sep 15;36(18):4699-4705
pubmed: 32579213
Database (Oxford). 2014 Jul 28;2014:
pubmed: 25070993
Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95
pubmed: 23197656
Bioinformatics. 2020 Aug 15;36(16):4383-4388
pubmed: 32470107
Genome Biol. 2002;3(2):COMMENT2001
pubmed: 11864365
Bioinformatics. 2015 May 15;31(10):1544-52
pubmed: 25653249
Genomics. 1990 Feb;6(2):389-91
pubmed: 12134874
Microb Biotechnol. 2018 Jul;11(4):588-605
pubmed: 29806194
Trends Biotechnol. 1996 Aug;14(8):273-9
pubmed: 8987457
PLoS One. 2011;6(6):e21400
pubmed: 21731731
Database (Oxford). 2017 Jan 10;2017:
pubmed: 28077566
Genome Biol. 2016 Sep 07;17(1):184
pubmed: 27604469
Genome Biol. 2017 May 8;18(1):85
pubmed: 28482857
Nucleic Acids Res. 2022 Jan 7;50(D1):D161-D164
pubmed: 34850943
Nucleic Acids Res. 2016 Jan 4;44(D1):D73-80
pubmed: 26578580
Nucleic Acids Res. 2019 Dec 2;47(21):10994-11006
pubmed: 31584084
Genome Biol. 2015 Mar 13;16:50
pubmed: 25785303
Nucleic Acids Res. 2021 Jul 2;49(W1):W469-W475
pubmed: 34038555
Genome Med. 2014 Mar 31;6(3):26
pubmed: 24944579
Nucleic Acids Res. 2022 Jan 7;50(D1):D106-D110
pubmed: 34850158
Database (Oxford). 2015 May 09;2015:bav043
pubmed: 25957950
Bioinformatics. 2002 Dec;18(12):1641-9
pubmed: 12490449
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489
pubmed: 33237286
Genes (Basel). 2011 Jul 13;2(3):449-501
pubmed: 24710207
Bioinformatics. 2009 May 1;25(9):1173-7
pubmed: 19254922
Bioinformatics. 2001 Jun;17(6):526-32; discussion 533-4
pubmed: 11395429
Nucleic Acids Res. 2012 Jan;40(Database issue):D57-63
pubmed: 22139929
BMC Bioinformatics. 2008 Apr 29;9 Suppl 5:S4
pubmed: 18460186
Nucleic Acids Res. 2016 Jan 4;44(D1):D336-42
pubmed: 26578592
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
PLoS Comput Biol. 2013;9(5):e1003063
pubmed: 23737737
Genome Biol. 2020 May 12;21(1):115
pubmed: 32398145
Nat Rev Genet. 2012 Apr 18;13(5):329-42
pubmed: 22510764
Proc Natl Acad Sci U S A. 2019 Nov 5;116(45):22651-22656
pubmed: 31636175
Nucleic Acids Res. 2015 Jan;43(Database issue):D257-60
pubmed: 25300481
Nucleic Acids Res. 2018 Jan 4;46(D1):D41-D47
pubmed: 29140468