Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures.
Journal
Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604
Informations de publication
Date de publication:
18 Sep 2024
18 Sep 2024
Historique:
received:
03
01
2024
accepted:
09
08
2024
medline:
19
9
2024
pubmed:
19
9
2024
entrez:
18
9
2024
Statut:
aheadofprint
Résumé
Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types-to 'map' variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal ( https://g2p.broadinstitute.org/ ): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.
Identifiants
pubmed: 39294369
doi: 10.1038/s41592-024-02409-0
pii: 10.1038/s41592-024-02409-0
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Subventions
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : UM1HG011969
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : RM1HG010461
Informations de copyright
© 2024. The Author(s).
Références
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844
pmcid: 8371605
doi: 10.1038/s41586-021-03819-2
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
pubmed: 34282049
pmcid: 7612213
doi: 10.1126/science.abj8754
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
pubmed: 38452047
doi: 10.1126/science.adl2528
Lin, Z. M. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
pubmed: 36927031
doi: 10.1126/science.ade2574
Hekkelman, M. L., Vries, I. D., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2023).
pubmed: 36424442
doi: 10.1038/s41592-022-01685-y
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
pubmed: 10592235
pmcid: 102472
doi: 10.1093/nar/28.1.235
Burley, S. K. et al. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, gky949 (2018).
Patwardhan, A. et al. Data management challenges in three-dimensional EM. Nat. Struct. Mol. Biol. 19, 1203–1207 (2012).
pubmed: 23211764
pmcid: 4048199
doi: 10.1038/nsmb.2426
Gudmundsson, S. et al. Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat. 43, 1012–1030 (2022).
pubmed: 34859531
doi: 10.1002/humu.24309
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, gkx1153 (2017).
Stenson, P. D. et al. The Human Gene Mutation Database (HGMD): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).
pubmed: 32596782
pmcid: 7497289
doi: 10.1007/s00439-020-02199-3
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
pubmed: 32461654
pmcid: 7334197
doi: 10.1038/s41586-020-2308-7
Turner, T. N. et al. denovo-db: a compendium of human de novo variants. Nucleic Acids Res. 45, D804–D811 (2017).
pubmed: 27907889
doi: 10.1093/nar/gkw865
Porto, E. M., Komor, A. C., Slaymaker, I. M. & Yeo, G. W. Base editing: advances and therapeutic opportunities. Nat. Rev. Drug Discov. 19, 839–859 (2020).
pubmed: 33077937
pmcid: 7721651
doi: 10.1038/s41573-020-0084-6
Lue, N. Z. et al. Base editor scanning charts the DNMT3A activity landscape. Nat. Chem. Biol. 19, 176–186 (2023).
pubmed: 36266353
doi: 10.1038/s41589-022-01167-4
Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).
pubmed: 31634902
pmcid: 6907074
doi: 10.1038/s41586-019-1711-4
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
pubmed: 27984732
pmcid: 5181115
doi: 10.1016/j.cell.2016.11.038
Andreadis, A., Gallego, M. E. & Nadal-Ginard, B. Generation of protein isoform diversity by alternative splicing: mechanistic and biological implications. Annu. Rev. Cell Biol. 3, 207–242 (1987).
pubmed: 2891362
doi: 10.1146/annurev.cb.03.110187.001231
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
pubmed: 11125122
pmcid: 29783
doi: 10.1093/nar/29.1.308
den Dunnen, J. T. Describing sequence variants using HGVS nomenclature. in Genotyping: Methods and Protocols (eds White S. J. & Cantsilieris S.) 243–251 (Springer New York, 2017).
Apweiler, R. et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).
pubmed: 14681372
pmcid: 308865
doi: 10.1093/nar/gkh131
Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2022).
pmcid: 9825485
doi: 10.1093/nar/gkac888
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
pubmed: 11752248
pmcid: 99161
doi: 10.1093/nar/30.1.38
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).
pubmed: 17130148
doi: 10.1093/nar/gkl842
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
pmcid: 8728224
doi: 10.1093/nar/gkab1061
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
pubmed: 35388217
pmcid: 9007741
doi: 10.1038/s41586-022-04558-8
Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–D520 (2015).
pubmed: 25514926
doi: 10.1093/nar/gku1267
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
pubmed: 31679514
pmcid: 6827219
doi: 10.1186/s13059-019-1845-6
Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8, 1551–1566 (2013).
pubmed: 23868073
pmcid: 6519453
doi: 10.1038/nprot.2013.092
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
pubmed: 6667333
doi: 10.1002/bip.360221211
Dana, J. M. et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 47, D482–D489 (2019).
pubmed: 30445541
doi: 10.1093/nar/gky1114
Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48, D335–D343 (2020).
pubmed: 31691821
Schrödinger, L. The PyMOL Molecular Graphics System, version 1.8 (2015).
Sancho, P. et al. Characterization of molecular mechanisms underlying the axonal Charcot–Marie–Tooth neuropathy caused by mutations. Hum. Mol. Genet 28, 1629–1644 (2019).
pubmed: 30624633
doi: 10.1093/hmg/ddz006
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
pubmed: 37733863
doi: 10.1126/science.adg7492
Ramos, E. M. et al. Characterizing genetic variants for clinical action. Am. J. Med. Genet. C Semin. Med. Genet. 166, 93–104 (2014).
doi: 10.1002/ajmg.c.31386
Lau, T. K. & Leung, T. N. Genetic screening and diagnosis. Curr. Opin. Obstet. Gynecol. 17, 163–169 (2005).
pubmed: 15758610
doi: 10.1097/01.gco.0000162187.99219.e0
Stark, Z. & Scott, R. H. Genomic newborn screening for rare diseases. Nat. Rev. Genet. 24, 755–766 (2023).
pubmed: 37386126
doi: 10.1038/s41576-023-00621-w
Hoffman-Andrews, L. The known unknown: the challenges of genetic variants of uncertain significance in clinical practice. J. Law Biosci. 4, 648–657 (2017).
pubmed: 29868193
doi: 10.1093/jlb/lsx038
Carter, T. C. & He, M. M. Challenges of identifying clinically actionable genetic variants for precision medicine. J. Healthc. Eng. https://doi.org/10.1155/2016/3617572 (2016).
Woodard, J., Iqbal, S. & Mashaghi, A. Circuit topology predicts pathogenicity of missense mutations. Proteins 90, 1634–1644 (2022).
pubmed: 35394672
pmcid: 9543832
doi: 10.1002/prot.26342
Iqbal, S. et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc. Natl Acad. Sci. USA 117, 28201–28211 (2020).
pubmed: 33106425
pmcid: 7668189
doi: 10.1073/pnas.2002660117
Iqbal, S. et al. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe. Nucleic Acids Res. 48, gkaa361 (2020).
doi: 10.1093/nar/gkaa361
Costain, G. & Andrade, D. M. Third-generation computational approaches for genetic variant interpretation. Brain 146, 411–412 (2023).
pubmed: 36691296
doi: 10.1093/brain/awad011
Watkins, X., Garcia, L. J., Pundir, S., Martin, M. J. & Consortium, U. ProtVista: visualization of protein sequence annotations. Bioinformatics 33, 2040–2041 (2017).
pubmed: 28334231
pmcid: 5963392
doi: 10.1093/bioinformatics/btx120
Bittrich, S. et al. RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics 38, 1452–1454 (2022).
pubmed: 34864908
doi: 10.1093/bioinformatics/btab813
Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun. https://doi.org/10.1038/s41467-019-10016-3 (2019).
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
pubmed: 24150940
doi: 10.1093/nar/gkt937
Stephenson, J. D., Laskowski, R. A., Nightingale, A., Hurles, M. E. & Thornton, J. VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations. Bioinformatics 35, 4854–4856 (2019).
pubmed: 31192369
pmcid: 6853667
doi: 10.1093/bioinformatics/btz482
Stephenson, J. D. et al. ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res. https://doi.org/10.1093/nar/gkae413 (2024).
Hicks, M., Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Functional characterization of 3D protein structures informed by human genetic diversity. Proc. Natl Acad. Sci. USA 116, 8960–8965 (2019).
pubmed: 30988206
pmcid: 6500140
doi: 10.1073/pnas.1820813116
Iqbal, S. et al. Delineation of functionally essential protein regions for 242 neurodevelopmental genes. Brain 146, 519–533 (2022).
pmcid: 9924913
doi: 10.1093/brain/awac381
Meller, A. et al. Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Nat. Commun. 14, 1177 (2023).
pubmed: 36859488
pmcid: 9977097
doi: 10.1038/s41467-023-36699-3
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
pubmed: 15980494
pmcid: 1160148
doi: 10.1093/nar/gki387
Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Brief. Bioinform. 23, bbac074 (2022).
pubmed: 35323860
doi: 10.1093/bib/bbac074
Smedley, D. et al. BioMart—biological queries made easy. BMC Genomics 10, 22 (2009).
pubmed: 19144180
pmcid: 2649164
doi: 10.1186/1471-2164-10-22
Segura, J., Rose, Y., Westbrook, J., Burley, S. K. & Duarte, J. M. RCSB Protein Data Bank 1D tools and services. Bioinformatics 36, btaa1012 (2020).
Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).
pubmed: 33956157
pmcid: 8262734
doi: 10.1093/nar/gkab314
Madeira, F. et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 50, W276–W279 (2022).
pubmed: 35412617
pmcid: 9252731
doi: 10.1093/nar/gkac240
Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).
pubmed: 12519945
pmcid: 165576
doi: 10.1093/nar/gkg129
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
pubmed: 21948594
doi: 10.1093/nar/gkr777
Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
pubmed: 16381955
doi: 10.1093/nar/gkj067
Weinreich, S. S., Mangon, R., Sikkens, J. J., Teeuw, M. E. E. & Cornel, M. C. Orphanet: a European database for rare diseases. Ned. Tijdschr. Geneeskd. 152, 518–519 (2008).
pubmed: 18389888
Hamosh, A., Scott, A. F., Amberger, J., Valle, D. & McKusick, V. A. Online Mendelian Inheritance In Man (OMIM). Hum. Mutat. 15, 57–61 (2000).
pubmed: 10612823
doi: 10.1002/(SICI)1098-1004(200001)15:1<57::AID-HUMU12>3.0.CO;2-G
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. https://doi.org/10.1186/s13059-016-0974-4 (2016).