Clustering protein functional families at large scale with hierarchical approaches.
CATH
FunFams
domain classification
embeddings
functional classification
Journal
Protein science : a publication of the Protein Society
ISSN: 1469-896X
Titre abrégé: Protein Sci
Pays: United States
ID NLM: 9211750
Informations de publication
Date de publication:
Sep 2024
Sep 2024
Historique:
revised:
22
07
2024
received:
27
03
2024
accepted:
24
07
2024
medline:
15
8
2024
pubmed:
15
8
2024
entrez:
15
8
2024
Statut:
ppublish
Résumé
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Substances chimiques
Proteins
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
e5140Subventions
Organisme : Biotechnology and Biological Sciences Research Council
Pays : United Kingdom
Organisme : Wellcome Trust
Pays : United Kingdom
Informations de copyright
© 2024 The Author(s). Protein Science published by Wiley Periodicals LLC on behalf of The Protein Society.
Références
Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya‐Garcia AA, et al. KinFams: de‐novo classification of protein kinases using CATH functional units. Biomolecules. 2023;13:277.
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 2014;42:D310–D314.
Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol. 2001;310:311–325.
Barrio‐Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist CLM, Wein T, et al. Clustering predicted structures at the scale of the known protein universe. Nature. 2023;622(7983):637–645.
Bashton M, Chothia C. The generation of new protein functions by the combination of domains. Structure. 2007;15:85–99.
Björklund AK, Ekman D, Light S, Frey‐Skött J, Elofsson A. Domain rearrangements in protein evolution. J Mol Biol. 2005;353:911–923.
Bordin N, Dallago C, Heinzinger M, Kim S, Littmann M, Rauer C, et al. Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci. 2023;48:345–359.
Bordin N, Sillitoe I, Lees JG, Orengo C. Tracing evolution through protein structures: nature captured in a few thousand folds. Front Mol Biosci. 2021;8:668184.
Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, et al. ECOD: an evolutionary classification of protein domains. PLoS Comput Biol. 2014;10:e1003926.
Chothia C. One thousand families for the molecular biologist. Nature. 1992;357:543–544.
Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional classification of CATH superfamilies: a domain‐based approach for protein function annotation. Bioinformatics. 2015;31:3460–3467.
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, et al. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res. 2015;43:W148–W153.
Del Sol MA, Pazos F, Valencia A. Automatic methods for predicting functionally important residues. J Mol Biol. 2003;326:1289–1302.
Dessailly BH, Redfern OC, Cuff A, Orengo CA. Exploiting structural classifications for function prediction: towards a domain grammar for protein function. Curr Opin Struct Biol. 2009;19:349–356.
Devos D, Dokudovskaya S, Alber F, Williams R, Chait BT, Sali A, et al. Components of coated vesicles and nuclear pore complexes share a common molecular architecture. PLoS Biol. 2004;2:e380.
Durairaj J, Waterhouse AM, Mets T, Brodiazhenko T, Abdullah M, Studer G, et al. Uncovering new families and folds in the natural protein universe. Nature. 2023;622:646–653.
Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, et al. ProtTrans: towards cracking the language of Life's code through self‐supervised deep learning and high performance computing. 2020 bioRxiv:2020.07.12.199554.
Fox NK, Brenner SE, Chandonia J‐M. SCOPe: structural classification of proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42:D304–D309.
Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure‐based protein function prediction using graph convolutional networks. Nat Commun. 2021;12:3168.
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, et al. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol. 2023;42(6):975–985.
Heinzinger M, Littmann M, Sillitoe I, Bordin N, Orengo C, Rost B. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform. 2022;4:lqac043.
Heinzinger M, Weissenow K, Sanchez JG, Henkel A, Steinegger M. Rost B ProstT5: bilingual language model for protein sequence and structure. bioRxiv 2023.07.23.550085. https://doi.org/10.1101/2023.07.23.550085
Jiang Y, Oron TR, Clark WT, Bankapur AR, D'Andrea D, Lepore R, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016;17:184.
Kilinc M, Jia K, Jernigan RL. Improved global protein homolog detection with major gains in function identification. Proc Natl Acad Sci USA. 2023;120:e2211823120.
Lee DA, Rentzsch R, Orengo C. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res. 2010;38:720–737.
Lees JG, Lee D, Studer RA, Dawson NL, Sillitoe I, Das S, et al. Gene3D: multi‐domain annotations for protein sequence and comparative genome analysis. Nucleic Acids Res. 2014;42:D240–D245.
Lewis TE, Sillitoe I, Dawson N, Lam SD, Clarke T, Lee D, et al. Gene3D: extensive prediction of globular domains in proteins. Nucleic Acids Res. 2018;46:D435–D439.
Li W, Godzik A. Cd‐hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659.
Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996;257:342–358.
Mihaljević L, Urban S. Decoding the functional evolution of an intramembrane protease superfamily by statistical coupling analysis. Structure. 2020;28:1329–1336.e4.
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 2019;48:D570–D578.
Nallapareddy V, Bordin N, Sillitoe I, Heinzinger M, Littmann M, Waman VP, et al. CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models. Bioinformatics. 2023;39:btad029.
Narayanan C, Gagné D, Reynolds KA, Doucet N. Conserved amino acid networks modulate discrete functional properties in an enzyme superfamily. Sci Rep. 2017;7:3207.
Neuwald AF, Aravind L, Altschul SF. Inferring joint sequence‐structural determinants of protein functional specificity. Elife. 2018;7:e29880.
Perdigão N, Heinrich J, Stolte C, Sabir KS, Buckley MJ, Tabor B, et al. Unexpected features of the dark proteome. Proc Natl Acad Sci U S A. 2015;112:15898–15903.
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large‐scale evaluation of computational protein function prediction. Nat Methods. 2013;10:221–227.
Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol. 2021;70:108–122.
Rentzsch R, Orengo CA. Protein function prediction using domain families. BMC Bioinformatics. 2013;14:S5.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118:e2016239118.
Rivoire O, Reynolds KA, Ranganathan R. Evolution‐based functional decomposition of proteins. PLoS Comput Biol. 2016;12:e1004817.
Sahraeian SM, Luo KR, Brenner SE. SIFTER search: a web server for accurate phylogeny‐based protein function prediction. Nucleic Acids Res. 2015;43:W141–W147.
Salinas VH, Ranganathan R. Coevolution‐based inference of amino acid interactions underlying protein function. Elife. 2018;7:e34300.
Santarella‐Mellwig R, Franke J, Jaedicke A, Gorjanacz M, Bauer U, Budd A, et al. The compartmentalized bacteria of the planctomycetes‐verrucomicrobia‐chlamydiae superphylum have membrane coat‐like proteins. PLoS Biol. 2010;8:e1000281.
Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021;49:D266–D273.
Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–1028.
Valdar WSJ. Scoring residue conservation. Proteins. 2002;48:227–241.
van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2023;42(2):243–246.
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold protein structure database: massively expanding the structural coverage of protein‐sequence space with high‐accuracy models. Nucleic Acids Res. 2022;50:D439–D444.
Yu L, Tanwar DK, Penha EDS, Wolf YI, Koonin EV, Basu MK. Grammar of protein domain architectures. Proc Natl Acad Sci U S A. 2019;116:3636–3645.
Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379:1358–1363.
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019;20:244.