Protein profiles: Biases and protocols.

Evaluation protocols Evolutionary information Profile-based prediction

Journal

Computational and structural biotechnology journal
ISSN: 2001-0370
Titre abrégé: Comput Struct Biotechnol J
Pays: Netherlands
ID NLM: 101585369

Informations de publication

Date de publication:
2020
Historique:
received: 14 05 2020
revised: 14 08 2020
accepted: 15 08 2020
entrez: 30 9 2020
pubmed: 1 10 2020
medline: 1 10 2020
Statut: epublish

Résumé

The use of evolutionary profiles to predict protein secondary structure, as well as other protein structural features, has been standard practice since the 1990s. Using profiles in the input of such predictors, in place or in addition to the sequence itself, leads to significantly more accurate predictions. While profiles can enhance structural signals, their role remains somewhat surprising as proteins do not use profiles when folding in vivo. Furthermore, the same sequence-based redundancy reduction protocols initially derived to train and evaluate sequence-based predictors, have been applied to train and evaluate profile-based predictors. This can lead to unfair comparisons since profiles may facilitate the bleeding of information between training and test sets. Here we use the extensively studied problem of secondary structure prediction to better evaluate the role of profiles and show that: (1) high levels of profile similarity between training and test proteins are observed when using standard sequence-based redundancy protocols; (2) the gain in accuracy for profile-based predictors, over sequence-based predictors, strongly relies on these high levels of profile similarity between training and test proteins; and (3) the overall accuracy of a profile-based predictor on a given protein dataset provides a

Identifiants

pubmed: 32994887
doi: 10.1016/j.csbj.2020.08.015
pii: S2001-0370(20)30368-8
pmc: PMC7486441
doi:

Types de publication

Journal Article

Langues

eng

Pagination

2281-2289

Informations de copyright

© 2020 The Author(s).

Références

Bioinformatics. 2006 Jul 1;22(13):1658-9
pubmed: 16731699
J Mol Graph Model. 2017 Sep;76:379-402
pubmed: 28763690
Bioinformatics. 2019 Jul 15;35(14):2403-2410
pubmed: 30535134
J Mol Biol. 1999 Sep 17;292(2):195-202
pubmed: 10493868
Science. 1985 Mar 22;227(4693):1435-41
pubmed: 2983426
J Comput Chem. 2018 Oct 5;39(26):2210-2216
pubmed: 30368831
Proteins. 2000 Aug 15;40(3):502-11
pubmed: 10861942
Nucleic Acids Res. 2000 Jan 1;28(1):235-42
pubmed: 10592235
Adv Enzyme Regul. 1991;31:121-81
pubmed: 1877385
Sci Rep. 2019 Aug 26;9(1):12374
pubmed: 31451723
Nat Methods. 2011 Dec 25;9(2):173-5
pubmed: 22198341
Bioinformatics. 2014 Sep 15;30(18):2592-7
pubmed: 24860169
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402
pubmed: 9254694
Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9
pubmed: 24304899
Protein Eng. 1999 Feb;12(2):85-94
pubmed: 10195279
Bioinformatics. 2017 Sep 15;33(18):2842-2849
pubmed: 28430949
Bioinformatics. 2005 Apr 15;21(8):1719-20
pubmed: 15585524
EMBO J. 1986 Apr;5(4):823-6
pubmed: 3709526
Science. 1981 Oct 9;214(4517):149-59
pubmed: 7280687
Sci Rep. 2015 Jun 22;5:11476
pubmed: 26098304
Proteins. 1991;9(1):56-68
pubmed: 2017436
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W72-6
pubmed: 15980571
Nucleic Acids Res. 2019 Jul 2;47(W1):W402-W407
pubmed: 31251384
J Mol Biol. 1988 Aug 20;202(4):865-84
pubmed: 3172241
Nucleic Acids Res. 2017 Jan 4;45(D1):D289-D295
pubmed: 27899584
Brief Bioinform. 2018 May 1;19(3):482-494
pubmed: 28040746
Nucleic Acids Res. 2015 Jul 1;43(W1):W389-94
pubmed: 25883141
Bioinformatics. 2003 Aug 12;19(12):1589-91
pubmed: 12912846
Proteins. 2000 Jul 1;40(1):6-22
pubmed: 10813826
Proc Natl Acad Sci U S A. 1998 May 26;95(11):6073-8
pubmed: 9600919
Proteins. 2002 May 1;47(2):228-35
pubmed: 11933069
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
Proc Natl Acad Sci U S A. 1993 Aug 15;90(16):7558-62
pubmed: 8356056

Auteurs

Gregor Urban (G)

Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA.

Mirko Torrisi (M)

UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland.

Christophe N Magnan (CN)

Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA.

Gianluca Pollastri (G)

UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland.

Pierre Baldi (P)

Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA.

Classifications MeSH