DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.


Journal

PLoS computational biology
ISSN: 1553-7358
Titre abrégé: PLoS Comput Biol
Pays: United States
ID NLM: 101238922

Informations de publication

Date de publication:
10 2022
Historique:
received: 15 02 2022
accepted: 26 09 2022
revised: 31 10 2022
pubmed: 20 10 2022
medline: 3 11 2022
entrez: 19 10 2022
Statut: epublish

Résumé

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.

Identifiants

pubmed: 36260616
doi: 10.1371/journal.pcbi.1010610
pii: PCOMPBIOL-D-22-00229
pmc: PMC9621593
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e1010610

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

Bioinformatics. 2006 Jul 1;22(13):1658-9
pubmed: 16731699
J Mol Biol. 1998 Oct 23;283(2):489-506
pubmed: 9769220
Nat Biotechnol. 2017 Nov;35(11):1026-1028
pubmed: 29035372
Database (Oxford). 2013 Apr 19;2013:bat023
pubmed: 23603847
PLoS Comput Biol. 2008 Oct;4(10):e1000160
pubmed: 18974821
Brief Bioinform. 2020 Mar 23;21(2):458-472
pubmed: 30698641
Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419
pubmed: 33125078
Bioessays. 2009 Mar;31(3):328-35
pubmed: 19260013
Nucleic Acids Res. 2000 Jan 1;28(1):33-6
pubmed: 10592175
BMC Bioinformatics. 2006 Jun 02;7:277
pubmed: 16749920
Annu Rev Biophys Biomol Struct. 2002;31:45-71
pubmed: 11988462
BMC Bioinformatics. 2021 Mar 12;22(1):121
pubmed: 33711918
Nucleic Acids Res. 2020 Jan 8;48(D1):D265-D268
pubmed: 31777944
J Mol Biol. 2003 May 2;328(3):749-67
pubmed: 12706730
Nucleic Acids Res. 2018 Jul 2;46(W1):W329-W337
pubmed: 29860432
Bioinformatics. 2019 Aug 15;35(16):2790-2795
pubmed: 30601942
Nucleic Acids Res. 2020 Jan 8;48(D1):D570-D578
pubmed: 31696235
Front Microbiol. 2018 Sep 04;9:2068
pubmed: 30233541
Nat Rev Mol Cell Biol. 2005 Mar;6(3):197-208
pubmed: 15738986
Mol Cell. 1999 Aug;4(2):143-52
pubmed: 10488330
Nucleic Acids Res. 2013 Jul;41(12):e121
pubmed: 23598997
Nucleic Acids Res. 2014 Jan;42(Database issue):D521-30
pubmed: 24271399
Genome Res. 2011 Mar;21(3):487-93
pubmed: 21209072
Nucleic Acids Res. 2019 Jan 8;47(D1):D351-D360
pubmed: 30398656
J Mol Biol. 2004 May 14;338(5):1027-36
pubmed: 15111065
Nucleic Acids Res. 2018 Jan 4;46(D1):D493-D496
pubmed: 29040681
Science. 2014 Jun 27;344(6191):1492-6
pubmed: 24970081
Proteins. 2015 Jul;83(7):1238-51
pubmed: 25917548
Nature. 2021 Aug;596(7873):583-589
pubmed: 34265844
Nat Methods. 2021 Apr;18(4):366-368
pubmed: 33828273
Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444
pubmed: 34791371
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84
pubmed: 11917018
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489
pubmed: 33237286
BMC Bioinformatics. 2009 Dec 15;10:421
pubmed: 20003500
J Mol Biol. 1990 Oct 5;215(3):403-10
pubmed: 2231712
J Mol Biol. 2001 Nov 2;313(4):673-81
pubmed: 11697896
Nucleic Acids Res. 2004 Mar 19;32(5):1792-7
pubmed: 15034147

Auteurs

Elena Tea Russo (ET)

SISSA, Trieste, Italy.
AREA SCIENCE PARK, Trieste, Italy.

Federico Barone (F)

SISSA, Trieste, Italy.
AREA SCIENCE PARK, Trieste, Italy.
Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy.

Alex Bateman (A)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom.

Stefano Cozzini (S)

AREA SCIENCE PARK, Trieste, Italy.

Marco Punta (M)

Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy.
Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, IRCCS San Raffaele Scientific Institute, Milan, Italy.

Alessandro Laio (A)

SISSA, Trieste, Italy.
ICTP, Trieste, Italy.

Articles similaires

Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Humans Male Female Intensive Care Units COVID-19

Insect diversity estimation in polarimetric lidar.

Dolores Bernenko, Meng Li, Hampus Månefjord et al.
1.00
Animals Biodiversity Insecta Algorithms Cluster Analysis

Classifications MeSH