Uncovering new families and folds in the natural protein universe.
Journal
Nature
ISSN: 1476-4687
Titre abrégé: Nature
Pays: England
ID NLM: 0410462
Informations de publication
Date de publication:
Oct 2023
Oct 2023
Historique:
received:
24
03
2023
accepted:
07
09
2023
medline:
23
10
2023
pubmed:
14
9
2023
entrez:
13
9
2023
Statut:
ppublish
Résumé
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database
Identifiants
pubmed: 37704037
doi: 10.1038/s41586-023-06622-3
pii: 10.1038/s41586-023-06622-3
pmc: PMC10584680
doi:
Substances chimiques
Proteins
0
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
646-653Informations de copyright
© 2023. The Author(s).
Références
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
pubmed: 34791371
doi: 10.1093/nar/gkab1061
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2020).
pmcid: 7779014
doi: 10.1093/nar/gkaa913
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
doi: 10.1093/nar/gkac1052
Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
pubmed: 36477304
doi: 10.1093/nar/gkac1080
Boutet, E. et al. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Methods Mol. Biol. 1374, 23–54 (2016).
pubmed: 26519399
doi: 10.1007/978-1-4939-3167-5_2
Paysan-Lafosse, T. et al. InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).
pubmed: 36350672
doi: 10.1093/nar/gkac993
Levitt, M. Nature of the protein universe. Proc. Natl Acad. Sci. USA 106, 11079–11084 (2009).
pubmed: 19541617
pmcid: 2698892
doi: 10.1073/pnas.0905029106
Bienert, S. et al. The SWISS-MODEL Repository—new features and functionality. Nucleic Acids Res. 45, D313–D319 (2017).
pubmed: 27899672
doi: 10.1093/nar/gkw1132
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
pubmed: 34039967
pmcid: 8155034
doi: 10.1038/s41467-021-23303-9
Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
pubmed: 25398609
doi: 10.1093/bioinformatics/btu739
Leinonen, R. et al. UniProt archive. Bioinformatics 20, 3236–3237 (2004).
pubmed: 15044231
doi: 10.1093/bioinformatics/bth191
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
pubmed: 29035372
doi: 10.1038/nbt.3988
Rismondo, J., Percy, M. G. & Gründling, A. Discovery of genes required for lipoteichoic acid glycosylation predicts two distinct mechanisms for wall teichoic acid glycosylation. J. Biol. Chem. 293, 3293–3306 (2018).
pubmed: 29343515
pmcid: 5836110
doi: 10.1074/jbc.RA117.001614
Söding, J. Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).
pubmed: 15531603
doi: 10.1093/bioinformatics/bti125
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01773-0 (2023).
doi: 10.1038/s41587-023-01773-0
pubmed: 37156916
Kelleher, D. J. & Gilmore, R. An evolving view of the eukaryotic oligosaccharyltransferase. Glycobiology 16, 47R–62R (2006).
pubmed: 16317064
doi: 10.1093/glycob/cwj066
Szymanski, C. M. & Wren, B. W. Protein glycosylation in bacterial mucosal pathogens. Nat. Rev. Microbiol. 3, 225–237 (2005).
Pereira, J. GCsnap: interactive snapshots for the comparison of protein-coding genomic contexts. J. Mol. Biol. 433, 166943 (2021).
pubmed: 33737026
doi: 10.1016/j.jmb.2021.166943
Gotfredsen, M. & Gerdes, K. The Escherichia coli relBE genes belong to a new toxin-antitoxin gene family. Mol. Microbiol. 29, 1065–1076 (1998).
pubmed: 9767574
doi: 10.1046/j.1365-2958.1998.00993.x
Jurėnas, D., Fraikin, N., Goormaghtigh, F. & Van Melderen, L. Biology and evolution of bacterial toxin-antitoxin systems. Nat. Rev. Microbiol. 20, 335–350 (2022).
pubmed: 34975154
doi: 10.1038/s41579-021-00661-1
Kurata, T. et al. A hyperpromiscuous antitoxin protein domain for the neutralization of diverse toxin domains. Proc. Natl Acad. Sci. USA 119, e2102212119 (2022).
pubmed: 35121656
pmcid: 8832971
doi: 10.1073/pnas.2102212119
Ziwei Ji Hong Kong University of Science and Technology, Hong Kong. et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. https://doi.org/10.1145/3571730 (2023).
doi: 10.1145/3571730
Akroyd, J. E., Clayson, E. & Higgins, N. P. Purification of the gam gene-product of bacteriophage Mu and determination of the nucleotide sequence of the gam gene. Nucleic Acids Res. 14, 6901–6914 (1986).
pubmed: 2945162
pmcid: 311707
doi: 10.1093/nar/14.17.6901
Nakae, S. et al. Structure of the EndoMS-DNA complex as mismatch restriction endonuclease. Structure 24, 1960–1971 (2016).
pubmed: 27773688
doi: 10.1016/j.str.2016.09.005
Aggarwal, A. K. Structure and function of restriction endonucleases. Curr. Opin. Struct. Biol. 5, 11–19 (1995).
pubmed: 7773740
doi: 10.1016/0959-440X(95)80004-K
Pingoud, A. & Jeltsch, A. Structure and function of type II restriction endonucleases. Nucleic Acids Res. 29, 3705–3727 (2001).
pubmed: 11557805
pmcid: 55916
doi: 10.1093/nar/29.18.3705
Klein, P., Somorjai, R. L. & Lau, P. C. Distinctive properties of signal sequences from bacterial lipoproteins. Protein Eng. 2, 15–20 (1988).
pubmed: 3253732
doi: 10.1093/protein/2.1.15
Hayashi, S. & Wu, H. C. Lipoproteins in bacteria. J. Bioenerg. Biomembr. 22, 451–471 (1990).
pubmed: 2202727
doi: 10.1007/BF00763177
Bateman, A. et al. Phospholipid scramblases and Tubby-like proteins belong to a new superfamily of membrane tethered transcription factors. Bioinformatics 25, 159–162 (2009).
pubmed: 19010806
doi: 10.1093/bioinformatics/btn595
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
pubmed: 36927031
doi: 10.1126/science.ade2574
Nepomnyachiy, S., Ben-Tal, N. & Kolodny, R. Global view of the protein universe. Proc. Natl Acad. Sci. USA 111, 11691–11696 (2014).
pubmed: 25071170
pmcid: 4136566
doi: 10.1073/pnas.1403395111
Alva, V., Remmert, M., Biegert, A., Lupas, A. N. & Söding, J. A galaxy of folds. Protein Sci. 19, 124–130 (2010).
pubmed: 19937658
doi: 10.1002/pro.297
Bordin, N. et al. AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Commun. Biol. 6, 160 (2023).
pubmed: 36755055
pmcid: 9908985
doi: 10.1038/s42003-023-04488-9
Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022).
pubmed: 36344848
pmcid: 9663297
doi: 10.1038/s41594-022-00849-w
Barrio-Hernandez, I. et al. Clustering predicted structures at the scale of the known protein universe. Nature, https://doi.org/10.1038/s41586-023-06510-w (2023).
Kaminski, K., Ludwiczak, J., Alva, V. & Dunin-Horkawicz, S. pLM-BLAST—distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics, btad579 (2023).
Pantolini, L., Studer, G., Pereira, J., Durairaj, J. & Schwede, T. Embedding-based alignment: combining protein language models and alignment approaches to detect structural similarities in the twilight-zone. Preprint at bioRxiv https://doi.org/10.1101/2022.12.13.520313 (2022).
Lomize, A. L., Todd, S. C. & Pogozheva, I. D. Spatial arrangement of proteins in planar and curved membranes by PPM 3.0. Protein Sci. 31, 209–220 (2022).
pubmed: 34716622
doi: 10.1002/pro.4219
Berisio, R. & Delogu, G. PGRS domain structures: doomed to sail the mycomembrane. PLoS Pathog. 18, e1010760 (2022).
pubmed: 36048802
pmcid: 9436101
doi: 10.1371/journal.ppat.1010760
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844
pmcid: 8371605
doi: 10.1038/s41586-021-03819-2
Hagberg, A. A., Schult, D. A. & Swart, P. J. Exploring network structure, dynamics, and function using NetworkX. In Proc. 7th Python in Science Conf. (eds Varoquaux, G. et al.) 11–15 (2008).
Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).
pubmed: 33956157
pmcid: 8262734
doi: 10.1093/nar/gkab314
Durairaj, J., Akdel, M., de Ridder, D. & van Dijk, A. D. J. Geometricus represents protein structures as shape-mers derived from moment invariants. Bioinformatics 36, i718–i725 (2020).
Flusser, J., Boldys, J. & Zitova, B. Moment forms invariant to rotation and blur in arbitrary number of dimensions. IEEE Trans. Pattern Anal. Machine Intell. 25, 234–246 (2003).
Flusser, J., Suk, T. & Zitová, B. 2D and 3D Image Analysis by Moments (Wiley, 2016).
Mamistvalov, A. G. n-dimensional moment invariants and conceptual mathematical theory of recognition n-dimensional solids. IEEE Trans. Pattern Anal. Machine Intell. 20, 819–831 (1998).
Hattne, J. & Lamzin, V. S. A moment invariant for evaluating the chirality of three-dimensional objects. J. R. Soc. Interface 8, 144–151 (2011).
pubmed: 20685692
doi: 10.1098/rsif.2010.0297
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32, 8024–8035 (2019).
Das, S. et al. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 32, 2889–2889 (2016).
Zhang, C., Shine, M., Pyle, A. M. & Zhang, Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19, 1109–1115 (2022).
pubmed: 36038728
doi: 10.1038/s41592-022-01585-1
AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinf. 20, 311 (2019).
doi: 10.1186/s12859-019-2932-0
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5, 135–146 (2017).
Rehurek, R. & Sojka, P. Gensim–python framework for vector space modelling. In Proc. LREC 2010 workshop New Challenges for NLP Frameworks, 45–50 (2010).
Holm, L. Using Dali for protein structure comparison. Methods Mol. Biol. 2112, 29–42 (2020).
pubmed: 32006276
doi: 10.1007/978-1-0716-0270-6_3
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
pubmed: 15849316
pmcid: 1084323
doi: 10.1093/nar/gki524
Mavridis, L. & Ritchie, D. W. 3D-blast: 3D protein structure alignment, comparison, and classification using spherical polar Fourier correlations. Pac. Symp. Biocomput. 2010, 281–292 (2010).
Wang, S. & Zheng, W.-M. CLePAPS: fast pair alignment of protein structures based on conformational letters. J. Bioinform. Comput. Biol. 6, 347–366 (2008).
pubmed: 18464327
doi: 10.1142/S0219720008003461
Liu, F. T., Ting, K. M. & Zhou, Z.-H. Isolation forest. In Proc. 2008 Eighth IEEE International Conference on Data Mining, 413–422 (2008).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Gabler, F. et al. Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).
pubmed: 33315308
doi: 10.1002/cpbi.108
Pereira, J. & Alva, V. How do I get the most out of my protein sequence using bioinformatics tools? Acta Crystallogr. D Struct. Biol. 77, 1116–1126 (2021).
pubmed: 34473083
pmcid: 8411974
doi: 10.1107/S2059798321007907
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
pubmed: 9254694
pmcid: 146917
doi: 10.1093/nar/25.17.3389
Frickey, T. & Lupas, A. CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20, 3702–3704 (2004).
pubmed: 15284097
doi: 10.1093/bioinformatics/bth444
Wang, Y., Huang, H., Rudin, C. & Shaposhnik, Y. Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J. Mach. Learn. Res. 22.1, 9129–9201(2021).
Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. Second International Conference on Knowledge Discovery and Data Mining 226–231 (AAAI Press, 1996).
Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).
pubmed: 22039361
pmcid: 3197634
doi: 10.1371/journal.pcbi.1002195
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2011).
pubmed: 22198341
doi: 10.1038/nmeth.1818
Quan, J. & Tian, J. Circular polymerase extension cloning for high-throughput cloning of complex and combinatorial DNA libraries. Nat. Protoc. 6, 242–251 (2011).
pubmed: 21293463
doi: 10.1038/nprot.2010.181
Guzman, L. M., Belin, D., Carson, M. J. & Beckwith, J. Tight regulation, modulation, and high-level expression by vectors containing the arabinose PBAD promoter. J. Bacteriol. 177, 4121 (1995).
pubmed: 7608087
pmcid: 177145
doi: 10.1128/jb.177.14.4121-4130.1995
Jaskólska, M. & Gerdes, K. CRP-dependent positive autoregulation and proteolytic degradation regulate competence activator Sxy of Escherichia coli. Mol. Microbiol. 95, 833–845 (2015).
pubmed: 25491382
doi: 10.1111/mmi.12901
Neidhardt, F. C., Bloch, P. L. & Smith, D. F. Culture medium for enterobacteria. J. Bacteriol. 119, 736–747 (1974).
pubmed: 4604283
pmcid: 245675
doi: 10.1128/jb.119.3.736-747.1974