Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
12 Mar 2021
Historique:
received: 25 10 2020
accepted: 09 02 2021
entrez: 13 3 2021
pubmed: 14 3 2021
medline: 13 4 2021
Statut: epublish

Résumé

The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

Sections du résumé

BACKGROUND BACKGROUND
The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence.
RESULTS RESULTS
We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results.
CONCLUSIONS CONCLUSIONS
The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.

Identifiants

pubmed: 33711918
doi: 10.1186/s12859-021-04013-x
pii: 10.1186/s12859-021-04013-x
pmc: PMC7955657
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

121

Subventions

Organisme : Wellcome Trust
ID : 105104/Z/14/Z
Pays : United Kingdom

Références

J Mol Biol. 2002 May 3;318(3):665-77
pubmed: 12054814
Brief Bioinform. 2012 Nov;13(6):656-68
pubmed: 22772836
Nucleic Acids Res. 2019 Jan 8;47(D1):D351-D360
pubmed: 30398656
Proteins. 2015 Jul;83(7):1238-51
pubmed: 25917548
PLoS Comput Biol. 2008 Oct;4(10):e1000160
pubmed: 18974821
Protein Sci. 1998 Feb;7(2):233-42
pubmed: 9521098
Nucleic Acids Res. 2000 Jan 1;28(1):33-6
pubmed: 10592175
BMC Bioinformatics. 2006 Jun 02;7:277
pubmed: 16749920
Nucleic Acids Res. 2009 Jan;37(Database issue):D380-6
pubmed: 19036790
PLoS Biol. 2011 Oct;9(10):e1001177
pubmed: 22028628
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432
pubmed: 30357350
Bioinformatics. 2006 Jul 1;22(13):1658-9
pubmed: 16731699
J Mol Biol. 2003 May 2;328(3):749-67
pubmed: 12706730
J Mol Biol. 2018 Jul 20;430(15):2237-2243
pubmed: 29258817
Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268
pubmed: 31777944
PLoS Biol. 2007 Mar;5(3):e77
pubmed: 17355176
Nucleic Acids Res. 2001 Jan 1;29(1):41-3
pubmed: 11125044
Nucleic Acids Res. 2013 Jul;41(12):e121
pubmed: 23598997
Nucleic Acids Res. 2014 Jan;42(Database issue):D521-30
pubmed: 24271399
Nucleic Acids Res. 2018 Jan 4;46(D1):D493-D496
pubmed: 29040681
Nat Methods. 2019 Dec;16(12):1315-1322
pubmed: 31636460
Science. 2014 Jun 27;344(6191):1492-6
pubmed: 24970081
Nucleic Acids Res. 2013 Jan;41(Database issue):D377-86
pubmed: 23193289
Bioinformatics. 2006 Feb 1;22(3):257-63
pubmed: 16322048
Nucleic Acids Res. 2018 Jan 4;46(D1):D435-D439
pubmed: 29112716
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84
pubmed: 11917018
Structure. 1999 Oct 15;7(10):1247-56
pubmed: 10545320
Proteins. 2009 May 15;75(3):760-73
pubmed: 19191354
Annu Rev Biophys Biomol Struct. 2002;31:45-71
pubmed: 11988462
Nat Commun. 2018 Jun 29;9(1):2542
pubmed: 29959318
BMC Bioinformatics. 2009 Dec 15;10:421
pubmed: 20003500
Nucleic Acids Res. 2004 Mar 19;32(5):1792-7
pubmed: 15034147

Auteurs

Elena Tea Russo (ET)

SISSA, 34136, Trieste, Italy.

Alessandro Laio (A)

SISSA, 34136, Trieste, Italy. laio@sissa.it.

Marco Punta (M)

Centre for Evolution and Cancer, The Institute of Cancer Research, London, SM2 5NG, UK.
Center for Omics Sciences, IRCCS San Raffaele Hospital, 20132, Milan, Italy.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH