Uncovering new families and folds in the natural protein universe.


Journal

Nature
ISSN: 1476-4687
Titre abrégé: Nature
Pays: England
ID NLM: 0410462

Informations de publication

Date de publication:
Oct 2023
Historique:
received: 24 03 2023
accepted: 07 09 2023
medline: 23 10 2023
pubmed: 14 9 2023
entrez: 13 9 2023
Statut: ppublish

Résumé

We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database

Identifiants

pubmed: 37704037
doi: 10.1038/s41586-023-06622-3
pii: 10.1038/s41586-023-06622-3
pmc: PMC10584680
doi:

Substances chimiques

Proteins 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

646-653

Informations de copyright

© 2023. The Author(s).

Références

Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
pubmed: 34791371 doi: 10.1093/nar/gkab1061
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2020).
pmcid: 7779014 doi: 10.1093/nar/gkaa913
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
doi: 10.1093/nar/gkac1052
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
pubmed: 36477304 doi: 10.1093/nar/gkac1080
Boutet, E. et al. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Methods Mol. Biol. 1374, 23–54 (2016).
pubmed: 26519399 doi: 10.1007/978-1-4939-3167-5_2
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).
pubmed: 36350672 doi: 10.1093/nar/gkac993
Levitt, M. Nature of the protein universe. Proc. Natl Acad. Sci. USA 106, 11079–11084 (2009).
pubmed: 19541617 pmcid: 2698892 doi: 10.1073/pnas.0905029106
Bienert, S. et al. The SWISS-MODEL Repository—new features and functionality. Nucleic Acids Res. 45, D313–D319 (2017).
pubmed: 27899672 doi: 10.1093/nar/gkw1132
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
pubmed: 34039967 pmcid: 8155034 doi: 10.1038/s41467-021-23303-9
Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
pubmed: 25398609 doi: 10.1093/bioinformatics/btu739
Leinonen, R. et al. UniProt archive. Bioinformatics 20, 3236–3237 (2004).
pubmed: 15044231 doi: 10.1093/bioinformatics/bth191
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
pubmed: 29035372 doi: 10.1038/nbt.3988
Rismondo, J., Percy, M. G. & Gründling, A. Discovery of genes required for lipoteichoic acid glycosylation predicts two distinct mechanisms for wall teichoic acid glycosylation. J. Biol. Chem. 293, 3293–3306 (2018).
pubmed: 29343515 pmcid: 5836110 doi: 10.1074/jbc.RA117.001614
Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).
pubmed: 15531603 doi: 10.1093/bioinformatics/bti125
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01773-0 (2023).
doi: 10.1038/s41587-023-01773-0 pubmed: 37156916
Kelleher, D. J. & Gilmore, R. An evolving view of the eukaryotic oligosaccharyltransferase. Glycobiology 16, 47R–62R (2006).
pubmed: 16317064 doi: 10.1093/glycob/cwj066
Szymanski, C. M. & Wren, B. W. Protein glycosylation in bacterial mucosal pathogens. Nat. Rev. Microbiol. 3, 225–237 (2005).
Pereira, J. GCsnap: interactive snapshots for the comparison of protein-coding genomic contexts. J. Mol. Biol. 433, 166943 (2021).
pubmed: 33737026 doi: 10.1016/j.jmb.2021.166943
Gotfredsen, M. & Gerdes, K. The Escherichia coli relBE genes belong to a new toxin-antitoxin gene family. Mol. Microbiol. 29, 1065–1076 (1998).
pubmed: 9767574 doi: 10.1046/j.1365-2958.1998.00993.x
Jurėnas, D., Fraikin, N., Goormaghtigh, F. & Van Melderen, L. Biology and evolution of bacterial toxin-antitoxin systems. Nat. Rev. Microbiol. 20, 335–350 (2022).
pubmed: 34975154 doi: 10.1038/s41579-021-00661-1
Kurata, T. et al. A hyperpromiscuous antitoxin protein domain for the neutralization of diverse toxin domains. Proc. Natl Acad. Sci. USA 119, e2102212119 (2022).
pubmed: 35121656 pmcid: 8832971 doi: 10.1073/pnas.2102212119
Ziwei Ji Hong Kong University of Science and Technology, Hong Kong. et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. https://doi.org/10.1145/3571730 (2023).
doi: 10.1145/3571730
Akroyd, J. E., Clayson, E. & Higgins, N. P. Purification of the gam gene-product of bacteriophage Mu and determination of the nucleotide sequence of the gam gene. Nucleic Acids Res. 14, 6901–6914 (1986).
pubmed: 2945162 pmcid: 311707 doi: 10.1093/nar/14.17.6901
Nakae, S. et al. Structure of the EndoMS-DNA complex as mismatch restriction endonuclease. Structure 24, 1960–1971 (2016).
pubmed: 27773688 doi: 10.1016/j.str.2016.09.005
Aggarwal, A. K. Structure and function of restriction endonucleases. Curr. Opin. Struct. Biol. 5, 11–19 (1995).
pubmed: 7773740 doi: 10.1016/0959-440X(95)80004-K
Pingoud, A. & Jeltsch, A. Structure and function of type II restriction endonucleases. Nucleic Acids Res. 29, 3705–3727 (2001).
pubmed: 11557805 pmcid: 55916 doi: 10.1093/nar/29.18.3705
Klein, P., Somorjai, R. L. & Lau, P. C. Distinctive properties of signal sequences from bacterial lipoproteins. Protein Eng. 2, 15–20 (1988).
pubmed: 3253732 doi: 10.1093/protein/2.1.15
Hayashi, S. & Wu, H. C. Lipoproteins in bacteria. J. Bioenerg. Biomembr. 22, 451–471 (1990).
pubmed: 2202727 doi: 10.1007/BF00763177
Bateman, A. et al. Phospholipid scramblases and Tubby-like proteins belong to a new superfamily of membrane tethered transcription factors. Bioinformatics 25, 159–162 (2009).
pubmed: 19010806 doi: 10.1093/bioinformatics/btn595
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
pubmed: 36927031 doi: 10.1126/science.ade2574
Nepomnyachiy, S., Ben-Tal, N. & Kolodny, R. Global view of the protein universe. Proc. Natl Acad. Sci. USA 111, 11691–11696 (2014).
pubmed: 25071170 pmcid: 4136566 doi: 10.1073/pnas.1403395111
Alva, V., Remmert, M., Biegert, A., Lupas, A. N. & Söding, J. A galaxy of folds. Protein Sci. 19, 124–130 (2010).
pubmed: 19937658 doi: 10.1002/pro.297
Bordin, N. et al. AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Commun. Biol. 6, 160 (2023).
pubmed: 36755055 pmcid: 9908985 doi: 10.1038/s42003-023-04488-9
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
pubmed: 36344848 pmcid: 9663297 doi: 10.1038/s41594-022-00849-w
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature, https://doi.org/10.1038/s41586-023-06510-w (2023).
Kaminski, K., Ludwiczak, J., Alva, V. & Dunin-Horkawicz, S. pLM-BLAST—distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics, btad579 (2023).
Pantolini, L., Studer, G., Pereira, J., Durairaj, J. & Schwede, T. Embedding-based alignment: combining protein language models and alignment approaches to detect structural similarities in the twilight-zone. Preprint at bioRxiv https://doi.org/10.1101/2022.12.13.520313 (2022).
Lomize, A. L., Todd, S. C. & Pogozheva, I. D. Spatial arrangement of proteins in planar and curved membranes by PPM 3.0. Protein Sci. 31, 209–220 (2022).
pubmed: 34716622 doi: 10.1002/pro.4219
Berisio, R. & Delogu, G. PGRS domain structures: doomed to sail the mycomembrane. PLoS Pathog. 18, e1010760 (2022).
pubmed: 36048802 pmcid: 9436101 doi: 10.1371/journal.ppat.1010760
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844 pmcid: 8371605 doi: 10.1038/s41586-021-03819-2
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conf. (eds Varoquaux, G. et al.) 11–15 (2008).
Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).
pubmed: 33956157 pmcid: 8262734 doi: 10.1093/nar/gkab314
Durairaj, J., Akdel, M., de Ridder, D. & van Dijk, A. D. J. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics 36, i718–i725 (2020).
Flusser, J., Boldys, J. & Zitova, B. Moment forms invariant to rotation and blur in arbitrary number of dimensions. IEEE Trans. Pattern Anal. Machine Intell. 25, 234–246 (2003).
Flusser, J., Suk, T. & Zitová, B. 2D and 3D Image Analysis by Moments (Wiley, 2016).
Mamistvalov, A. G. n-dimensional moment invariants and conceptual mathematical theory of recognition n-dimensional solids. IEEE Trans. Pattern Anal. Machine Intell. 20, 819–831 (1998).
Hattne, J. & Lamzin, V. S. A moment invariant for evaluating the chirality of three-dimensional objects. J. R. Soc. Interface 8, 144–151 (2011).
pubmed: 20685692 doi: 10.1098/rsif.2010.0297
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, 8024–8035 (2019).
Das, S. et al. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 32, 2889–2889 (2016).
Zhang, C., Shine, M., Pyle, A. M. & Zhang, Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19, 1109–1115 (2022).
pubmed: 36038728 doi: 10.1038/s41592-022-01585-1
AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinf. 20, 311 (2019).
doi: 10.1186/s12859-019-2932-0
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5, 135–146 (2017).
Rehurek, R. & Sojka, P. Gensim–python framework for vector space modelling. In Proc. LREC 2010 workshop New Challenges for NLP Frameworks, 45–50 (2010).
Holm, L. Using Dali for protein structure comparison. Methods Mol. Biol. 2112, 29–42 (2020).
pubmed: 32006276 doi: 10.1007/978-1-0716-0270-6_3
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
pubmed: 15849316 pmcid: 1084323 doi: 10.1093/nar/gki524
Mavridis, L. & Ritchie, D. W. 3D-blast: 3D protein structure alignment, comparison, and classification using spherical polar Fourier correlations. Pac. Symp. Biocomput. 2010, 281–292 (2010).
Wang, S. & Zheng, W.-M. CLePAPS: fast pair alignment of protein structures based on conformational letters. J. Bioinform. Comput. Biol. 6, 347–366 (2008).
pubmed: 18464327 doi: 10.1142/S0219720008003461
Liu, F. T., Ting, K. M. & Zhou, Z.-H. Isolation forest. In Proc. 2008 Eighth IEEE International Conference on Data Mining, 413–422 (2008).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Gabler, F. et al. Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).
pubmed: 33315308 doi: 10.1002/cpbi.108
Pereira, J. & Alva, V. How do I get the most out of my protein sequence using bioinformatics tools? Acta Crystallogr. D Struct. Biol. 77, 1116–1126 (2021).
pubmed: 34473083 pmcid: 8411974 doi: 10.1107/S2059798321007907
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
pubmed: 9254694 pmcid: 146917 doi: 10.1093/nar/25.17.3389
Frickey, T. & Lupas, A. CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20, 3702–3704 (2004).
pubmed: 15284097 doi: 10.1093/bioinformatics/bth444
Wang, Y., Huang, H., Rudin, C. & Shaposhnik, Y. Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J. Mach. Learn. Res. 22.1, 9129–9201(2021).
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. Second International Conference on Knowledge Discovery and Data Mining 226–231 (AAAI Press, 1996).
Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).
pubmed: 22039361 pmcid: 3197634 doi: 10.1371/journal.pcbi.1002195
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2011).
pubmed: 22198341 doi: 10.1038/nmeth.1818
Quan, J. & Tian, J. Circular polymerase extension cloning for high-throughput cloning of complex and combinatorial DNA libraries. Nat. Protoc. 6, 242–251 (2011).
pubmed: 21293463 doi: 10.1038/nprot.2010.181
Guzman, L. M., Belin, D., Carson, M. J. & Beckwith, J. Tight regulation, modulation, and high-level expression by vectors containing the arabinose PBAD promoter. J. Bacteriol. 177, 4121 (1995).
pubmed: 7608087 pmcid: 177145 doi: 10.1128/jb.177.14.4121-4130.1995
Jaskólska, M. & Gerdes, K. CRP-dependent positive autoregulation and proteolytic degradation regulate competence activator Sxy of Escherichia coli. Mol. Microbiol. 95, 833–845 (2015).
pubmed: 25491382 doi: 10.1111/mmi.12901
Neidhardt, F. C., Bloch, P. L. & Smith, D. F. Culture medium for enterobacteria. J. Bacteriol. 119, 736–747 (1974).
pubmed: 4604283 pmcid: 245675 doi: 10.1128/jb.119.3.736-747.1974

Auteurs

Janani Durairaj (J)

Biozentrum, University of Basel, Basel, Switzerland.
SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland.

Andrew M Waterhouse (AM)

Biozentrum, University of Basel, Basel, Switzerland.
SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland.

Toomas Mets (T)

Institute of Technology, University of Tartu, Tartu, Estonia.
Department of Experimental Medical Science, Lund University, Lund, Sweden.

Tetiana Brodiazhenko (T)

Institute of Technology, University of Tartu, Tartu, Estonia.

Minhal Abdullah (M)

Institute of Technology, University of Tartu, Tartu, Estonia.
Department of Experimental Medical Science, Lund University, Lund, Sweden.

Gabriel Studer (G)

Biozentrum, University of Basel, Basel, Switzerland.
SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland.

Gerardo Tauriello (G)

Biozentrum, University of Basel, Basel, Switzerland.
SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland.

Mehmet Akdel (M)

VantAI, New York, NY, USA.

Antonina Andreeva (A)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK.

Alex Bateman (A)

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK.

Tanel Tenson (T)

Institute of Technology, University of Tartu, Tartu, Estonia.

Vasili Hauryliuk (V)

Institute of Technology, University of Tartu, Tartu, Estonia.
Department of Experimental Medical Science, Lund University, Lund, Sweden.
Science for Life Laboratory, Lund, Sweden.
Virus Centre, Lund University, Lund, Sweden.

Torsten Schwede (T)

Biozentrum, University of Basel, Basel, Switzerland. torsten.schwede@unibas.ch.
SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland. torsten.schwede@unibas.ch.

Joana Pereira (J)

Biozentrum, University of Basel, Basel, Switzerland. joana.pereira@unibas.ch.
SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland. joana.pereira@unibas.ch.

Articles similaires

Databases, Protein Protein Domains Protein Folding Proteins Deep Learning
Animals Hemiptera Insect Proteins Phylogeny Insecticides
Cephalometry Humans Anatomic Landmarks Software Internet
Humans Breast Neoplasms Female Deep Learning Ultrasonography, Mammary

Classifications MeSH