Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures.


Journal

Nature methods
ISSN: 1548-7105
Titre abrégé: Nat Methods
Pays: United States
ID NLM: 101215604

Informations de publication

Date de publication:
18 Sep 2024
Historique:
received: 03 01 2024
accepted: 09 08 2024
medline: 19 9 2024
pubmed: 19 9 2024
entrez: 18 9 2024
Statut: aheadofprint

Résumé

Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types-to 'map' variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal ( https://g2p.broadinstitute.org/ ): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.

Identifiants

pubmed: 39294369
doi: 10.1038/s41592-024-02409-0
pii: 10.1038/s41592-024-02409-0
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : UM1HG011969
Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)
ID : RM1HG010461

Informations de copyright

© 2024. The Author(s).

Références

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
pubmed: 34265844 pmcid: 8371605 doi: 10.1038/s41586-021-03819-2
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
pubmed: 34282049 pmcid: 7612213 doi: 10.1126/science.abj8754
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
pubmed: 38452047 doi: 10.1126/science.adl2528
Lin, Z. M. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
pubmed: 36927031 doi: 10.1126/science.ade2574
Hekkelman, M. L., Vries, I. D., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2023).
pubmed: 36424442 doi: 10.1038/s41592-022-01685-y
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
pubmed: 10592235 pmcid: 102472 doi: 10.1093/nar/28.1.235
Burley, S. K. et al. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, gky949 (2018).
Patwardhan, A. et al. Data management challenges in three-dimensional EM. Nat. Struct. Mol. Biol. 19, 1203–1207 (2012).
pubmed: 23211764 pmcid: 4048199 doi: 10.1038/nsmb.2426
Gudmundsson, S. et al. Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat. 43, 1012–1030 (2022).
pubmed: 34859531 doi: 10.1002/humu.24309
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, gkx1153 (2017).
Stenson, P. D. et al. The Human Gene Mutation Database (HGMD): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).
pubmed: 32596782 pmcid: 7497289 doi: 10.1007/s00439-020-02199-3
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
pubmed: 32461654 pmcid: 7334197 doi: 10.1038/s41586-020-2308-7
Turner, T. N. et al. denovo-db: a compendium of human de novo variants. Nucleic Acids Res. 45, D804–D811 (2017).
pubmed: 27907889 doi: 10.1093/nar/gkw865
Porto, E. M., Komor, A. C., Slaymaker, I. M. & Yeo, G. W. Base editing: advances and therapeutic opportunities. Nat. Rev. Drug Discov. 19, 839–859 (2020).
pubmed: 33077937 pmcid: 7721651 doi: 10.1038/s41573-020-0084-6
Lue, N. Z. et al. Base editor scanning charts the DNMT3A activity landscape. Nat. Chem. Biol. 19, 176–186 (2023).
pubmed: 36266353 doi: 10.1038/s41589-022-01167-4
Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).
pubmed: 31634902 pmcid: 6907074 doi: 10.1038/s41586-019-1711-4
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
pubmed: 27984732 pmcid: 5181115 doi: 10.1016/j.cell.2016.11.038
Andreadis, A., Gallego, M. E. & Nadal-Ginard, B. Generation of protein isoform diversity by alternative splicing: mechanistic and biological implications. Annu. Rev. Cell Biol. 3, 207–242 (1987).
pubmed: 2891362 doi: 10.1146/annurev.cb.03.110187.001231
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
pubmed: 11125122 pmcid: 29783 doi: 10.1093/nar/29.1.308
den Dunnen, J. T. Describing sequence variants using HGVS nomenclature. in Genotyping: Methods and Protocols (eds White S. J. & Cantsilieris S.) 243–251 (Springer New York, 2017).
Apweiler, R. et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).
pubmed: 14681372 pmcid: 308865 doi: 10.1093/nar/gkh131
Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2022).
pmcid: 9825485 doi: 10.1093/nar/gkac888
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
pubmed: 11752248 pmcid: 99161 doi: 10.1093/nar/30.1.38
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).
pubmed: 17130148 doi: 10.1093/nar/gkl842
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
pmcid: 8728224 doi: 10.1093/nar/gkab1061
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
pubmed: 35388217 pmcid: 9007741 doi: 10.1038/s41586-022-04558-8
Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–D520 (2015).
pubmed: 25514926 doi: 10.1093/nar/gku1267
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
pubmed: 31679514 pmcid: 6827219 doi: 10.1186/s13059-019-1845-6
Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8, 1551–1566 (2013).
pubmed: 23868073 pmcid: 6519453 doi: 10.1038/nprot.2013.092
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
pubmed: 6667333 doi: 10.1002/bip.360221211
Dana, J. M. et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 47, D482–D489 (2019).
pubmed: 30445541 doi: 10.1093/nar/gky1114
Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48, D335–D343 (2020).
pubmed: 31691821
Schrödinger, L. The PyMOL Molecular Graphics System, version 1.8 (2015).
Sancho, P. et al. Characterization of molecular mechanisms underlying the axonal Charcot–Marie–Tooth neuropathy caused by mutations. Hum. Mol. Genet 28, 1629–1644 (2019).
pubmed: 30624633 doi: 10.1093/hmg/ddz006
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
pubmed: 37733863 doi: 10.1126/science.adg7492
Ramos, E. M. et al. Characterizing genetic variants for clinical action. Am. J. Med. Genet. C Semin. Med. Genet. 166, 93–104 (2014).
doi: 10.1002/ajmg.c.31386
Lau, T. K. & Leung, T. N. Genetic screening and diagnosis. Curr. Opin. Obstet. Gynecol. 17, 163–169 (2005).
pubmed: 15758610 doi: 10.1097/01.gco.0000162187.99219.e0
Stark, Z. & Scott, R. H. Genomic newborn screening for rare diseases. Nat. Rev. Genet. 24, 755–766 (2023).
pubmed: 37386126 doi: 10.1038/s41576-023-00621-w
Hoffman-Andrews, L. The known unknown: the challenges of genetic variants of uncertain significance in clinical practice. J. Law Biosci. 4, 648–657 (2017).
pubmed: 29868193 doi: 10.1093/jlb/lsx038
Carter, T. C. & He, M. M. Challenges of identifying clinically actionable genetic variants for precision medicine. J. Healthc. Eng. https://doi.org/10.1155/2016/3617572 (2016).
Woodard, J., Iqbal, S. & Mashaghi, A. Circuit topology predicts pathogenicity of missense mutations. Proteins 90, 1634–1644 (2022).
pubmed: 35394672 pmcid: 9543832 doi: 10.1002/prot.26342
Iqbal, S. et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc. Natl Acad. Sci. USA 117, 28201–28211 (2020).
pubmed: 33106425 pmcid: 7668189 doi: 10.1073/pnas.2002660117
Iqbal, S. et al. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe. Nucleic Acids Res. 48, gkaa361 (2020).
doi: 10.1093/nar/gkaa361
Costain, G. & Andrade, D. M. Third-generation computational approaches for genetic variant interpretation. Brain 146, 411–412 (2023).
pubmed: 36691296 doi: 10.1093/brain/awad011
Watkins, X., Garcia, L. J., Pundir, S., Martin, M. J. & Consortium, U. ProtVista: visualization of protein sequence annotations. Bioinformatics 33, 2040–2041 (2017).
pubmed: 28334231 pmcid: 5963392 doi: 10.1093/bioinformatics/btx120
Bittrich, S. et al. RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics 38, 1452–1454 (2022).
pubmed: 34864908 doi: 10.1093/bioinformatics/btab813
Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun. https://doi.org/10.1038/s41467-019-10016-3 (2019).
Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).
pubmed: 24150940 doi: 10.1093/nar/gkt937
Stephenson, J. D., Laskowski, R. A., Nightingale, A., Hurles, M. E. & Thornton, J. VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations. Bioinformatics 35, 4854–4856 (2019).
pubmed: 31192369 pmcid: 6853667 doi: 10.1093/bioinformatics/btz482
Stephenson, J. D. et al. ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res. https://doi.org/10.1093/nar/gkae413 (2024).
Hicks, M., Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Functional characterization of 3D protein structures informed by human genetic diversity. Proc. Natl Acad. Sci. USA 116, 8960–8965 (2019).
pubmed: 30988206 pmcid: 6500140 doi: 10.1073/pnas.1820813116
Iqbal, S. et al. Delineation of functionally essential protein regions for 242 neurodevelopmental genes. Brain 146, 519–533 (2022).
pmcid: 9924913 doi: 10.1093/brain/awac381
Meller, A. et al. Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Nat. Commun. 14, 1177 (2023).
pubmed: 36859488 pmcid: 9977097 doi: 10.1038/s41467-023-36699-3
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
pubmed: 15980494 pmcid: 1160148 doi: 10.1093/nar/gki387
Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Brief. Bioinform. 23, bbac074 (2022).
pubmed: 35323860 doi: 10.1093/bib/bbac074
Smedley, D. et al. BioMart—biological queries made easy. BMC Genomics 10, 22 (2009).
pubmed: 19144180 pmcid: 2649164 doi: 10.1186/1471-2164-10-22
Segura, J., Rose, Y., Westbrook, J., Burley, S. K. & Duarte, J. M. RCSB Protein Data Bank 1D tools and services. Bioinformatics 36, btaa1012 (2020).
Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).
pubmed: 33956157 pmcid: 8262734 doi: 10.1093/nar/gkab314
Madeira, F. et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 50, W276–W279 (2022).
pubmed: 35412617 pmcid: 9252731 doi: 10.1093/nar/gkac240
Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).
pubmed: 12519945 pmcid: 165576 doi: 10.1093/nar/gkg129
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).
pubmed: 21948594 doi: 10.1093/nar/gkr777
Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).
pubmed: 16381955 doi: 10.1093/nar/gkj067
Weinreich, S. S., Mangon, R., Sikkens, J. J., Teeuw, M. E. E. & Cornel, M. C. Orphanet: a European database for rare diseases. Ned. Tijdschr. Geneeskd. 152, 518–519 (2008).
pubmed: 18389888
Hamosh, A., Scott, A. F., Amberger, J., Valle, D. & McKusick, V. A. Online Mendelian Inheritance In Man (OMIM). Hum. Mutat. 15, 57–61 (2000).
pubmed: 10612823 doi: 10.1002/(SICI)1098-1004(200001)15:1<57::AID-HUMU12>3.0.CO;2-G
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. https://doi.org/10.1186/s13059-016-0974-4 (2016).

Auteurs

Seulki Kwon (S)

Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Jordan Safer (J)

Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Duyen T Nguyen (DT)

PATTERN, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

David Hoksza (D)

Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic.

Patrick May (P)

Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg.

Jeremy A Arbesfeld (JA)

The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children's Hospital, Columbus, OH, USA.

Alan F Rubin (AF)

Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.
Department of Medical Biology, University of Melbourne, Parkville, Victoria, Australia.

Arthur J Campbell (AJ)

Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Alex Burgin (A)

Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.

Sumaiya Iqbal (S)

Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. sumaiya@broadinstitute.org.
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. sumaiya@broadinstitute.org.
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA. sumaiya@broadinstitute.org.
Cancer Data Sciences, Dana-Farber/Harvard Cancer Center, Boston, MA, USA. sumaiya@broadinstitute.org.

Classifications MeSH