Estimating probabilistic context-free grammars for proteins using contact map constraints.

Contrastive estimation Maximum-likelihood estimator Probabilistic context-free grammar Protein contact map Protein sequence Structural constraints Syntactic tree

Journal

PeerJ
ISSN: 2167-8359
Titre abrégé: PeerJ
Pays: United States
ID NLM: 101603425

Informations de publication

Date de publication:
2019
Historique:
received: 26 07 2018
accepted: 03 02 2019
entrez: 29 3 2019
pubmed: 29 3 2019
medline: 29 3 2019
Statut: epublish

Résumé

Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

Identifiants

pubmed: 30918754
doi: 10.7717/peerj.6559
pii: 6559
pmc: PMC6428041
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e6559

Déclaration de conflit d'intérêts

The authors declare there are no competing interests.

Références

Bioinformatics. 1999 Jun;15(6):446-54
pubmed: 10383470
Nucleic Acids Res. 2000 Jan 1;28(1):235-42
pubmed: 10592235
Brief Bioinform. 2002 Sep;3(3):265-74
pubmed: 12230035
Nature. 2002 Nov 14;420(6912):211-7
pubmed: 12432405
J Mol Biol. 2003 Aug 15;331(3):593-604
pubmed: 12899831
BMC Bioinformatics. 2004 Jun 04;5:71
pubmed: 15180907
Bioinformatics. 2005 Apr 1;21(7):951-60
pubmed: 15531603
Bioinformatics. 2006 Jul 1;22(13):1658-9
pubmed: 16731699
J Struct Biol. 2008 Nov;164(2):177-82
pubmed: 18682294
Proc Natl Acad Sci U S A. 2009 Jan 6;106(1):67-72
pubmed: 19116270
Bioinformatics. 2009 Jun 1;25(11):1422-3
pubmed: 19304878
BMC Bioinformatics. 2009 Oct 08;10:323
pubmed: 19814800
Gen Physiol Biophys. 2009;28 Spec No Focus:F82-8
pubmed: 20093731
J Am Chem Soc. 2010 Oct 6;132(39):13765-75
pubmed: 20828131
Biochem Mol Biol Educ. 2006 Jul;34(4):255-61
pubmed: 21638687
PLoS Comput Biol. 2011 Oct;7(10):e1002195
pubmed: 22039361
Bioinformatics. 2012 Jan 15;28(2):184-90
pubmed: 22101153
Proc Natl Acad Sci U S A. 2011 Dec 6;108(49):E1293-301
pubmed: 22106262
Nat Methods. 2011 Dec 25;9(2):173-5
pubmed: 22198341
FASEB J. 1990 Nov;4(14):3198-208
pubmed: 2227211
Bioinformatics. 2012 Oct 15;28(20):2691-2
pubmed: 22877864
Biopolymers. 2013 Mar;99(3):203-17
pubmed: 23034580
Bioinformatics. 2012 Dec 1;28(23):3150-2
pubmed: 23060610
Nucleic Acids Res. 2013 Jan;41(Database issue):D344-7
pubmed: 23161676
Nucleic Acids Res. 2013 Jan;41(Database issue):D396-401
pubmed: 23175607
PLoS Biol. 2012;10(12):e1001451
pubmed: 23300377
Phys Rev E Stat Nonlin Soft Matter Phys. 2013 Jan;87(1):012707
pubmed: 23410359
Proc Natl Acad Sci U S A. 2013 Sep 24;110(39):15674-9
pubmed: 24009338
Algorithms Mol Biol. 2013 Dec 18;8(1):31
pubmed: 24350601
PLoS One. 2014 Mar 24;9(3):e92721
pubmed: 24663061
Elife. 2014 May 01;3:e02030
pubmed: 24842992
J Biol Chem. 1989 Jun 5;264(16):9547-51
pubmed: 2498333
Bioinformatics. 2014 Nov 1;30(21):3128-30
pubmed: 25064567
Genome Biol Evol. 2014 Nov 13;6(12):3137-58
pubmed: 25398782
PLoS Biol. 2015 Feb 11;13(2):e1002059
pubmed: 25671553
Sci Rep. 2015 Jul 29;5:12494
pubmed: 26219477
J Chem Inf Model. 2015 Aug 24;55(8):1663-72
pubmed: 26226489
Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85
pubmed: 26673716
PLoS Comput Biol. 2017 Jan 5;13(1):e1005324
pubmed: 28056090
Nat Biotechnol. 2017 Feb;35(2):128-135
pubmed: 28092658
Biochem J. 1986 Nov 15;240(1):289-92
pubmed: 3827849
Nucleic Acids Res. 1984 Mar 12;12(5):2561-8
pubmed: 6200832
Protein Profile. 1995;2(4):297-490
pubmed: 7553064
Proc Int Conf Intell Syst Mol Biol. 1994;2:28-36
pubmed: 7584402
Protein Eng. 1995 Feb;8(2):127-34
pubmed: 7630882
Protein Eng. 1994 Sep;7(9):1059-68
pubmed: 7831276
Nucleic Acids Res. 1994 Jun 11;22(11):2079-88
pubmed: 8029015
Nucleic Acids Res. 1998 Jan 1;26(1):320-2
pubmed: 9399864
Bioinformatics. 1998;14(9):755-63
pubmed: 9918945

Auteurs

Witold Dyrka (W)

Wydział Podstawowych Problemów Techniki, Katedra Inżynierii Biomedycznej, Politechnika Wrocławska, Wrocław, Poland.

Mateusz Pyzik (M)

Wydział Podstawowych Problemów Techniki, Katedra Inżynierii Biomedycznej, Politechnika Wrocławska, Wrocław, Poland.

François Coste (F)

Univ Rennes, Inria, CNRS, IRISA, Rennes, France.

Hugo Talibart (H)

Univ Rennes, Inria, CNRS, IRISA, Rennes, France.

Classifications MeSH