Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures.

Journal

Nature methods

ISSN: 1548-7105

Titre abrégé: Nat Methods

Pays: United States

ID NLM: 101215604

Informations de publication

Date de publication:
18 Sep 2024

Historique:

received: 03 01 2024

accepted: 09 08 2024

medline: 19 9 2024

pubmed: 19 9 2024

entrez: 18 9 2024

Statut: aheadofprint

Résumé

Recent advances in AI-based methods have revolutionized the field of structural biology. Concomitantly, high-throughput sequencing and functional genomics have generated genetic variants at an unprecedented scale. However, efficient tools and resources are needed to link disparate data types-to 'map' variants onto protein structures, to better understand how the variation causes disease, and thereby design therapeutics. Here we present the Genomics 2 Proteins portal ( https://g2p.broadinstitute.org/ ): a human proteome-wide resource that maps 20,076,998 genetic variants onto 42,413 protein sequences and 77,923 structures, with a comprehensive set of structural and functional features. Additionally, the Genomics 2 Proteins portal allows users to interactively upload protein residue-wise annotations (for example, variants and scores) as well as the protein structure beyond databases to establish the connection between genomics to proteins. The portal serves as an easy-to-use discovery tool for researchers and scientists to hypothesize the structure-function relationship between natural or synthetic variations and their molecular phenotypes.

Identifiants

DOI: 10.1038/s41592-024-02409-0 PMID: 39294369

pubmed: 39294369

doi: 10.1038/s41592-024-02409-0

pii: 10.1038/s41592-024-02409-0

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Subventions

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

ID : UM1HG011969

Organisme : U.S. Department of Health & Human Services | NIH | National Human Genome Research Institute (NHGRI)

ID : RM1HG010461

Informations de copyright

Références

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

pubmed: 34265844 pmcid: 8371605 doi: 10.1038/s41586-021-03819-2

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

pubmed: 34282049 pmcid: 7612213 doi: 10.1126/science.abj8754

Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).

pubmed: 38452047 doi: 10.1126/science.adl2528

Lin, Z. M. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

pubmed: 36927031 doi: 10.1126/science.ade2574

Hekkelman, M. L., Vries, I. D., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20, 205–213 (2023).

pubmed: 36424442 doi: 10.1038/s41592-022-01685-y

Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

pubmed: 10592235 pmcid: 102472 doi: 10.1093/nar/28.1.235

Burley, S. K. et al. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, gky949 (2018).

Patwardhan, A. et al. Data management challenges in three-dimensional EM. Nat. Struct. Mol. Biol. 19, 1203–1207 (2012).

pubmed: 23211764 pmcid: 4048199 doi: 10.1038/nsmb.2426

Gudmundsson, S. et al. Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat. 43, 1012–1030 (2022).

pubmed: 34859531 doi: 10.1002/humu.24309

Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, gkx1153 (2017).

Stenson, P. D. et al. The Human Gene Mutation Database (HGMD): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).

pubmed: 32596782 pmcid: 7497289 doi: 10.1007/s00439-020-02199-3

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

pubmed: 32461654 pmcid: 7334197 doi: 10.1038/s41586-020-2308-7

Turner, T. N. et al. denovo-db: a compendium of human de novo variants. Nucleic Acids Res. 45, D804–D811 (2017).

pubmed: 27907889 doi: 10.1093/nar/gkw865

Porto, E. M., Komor, A. C., Slaymaker, I. M. & Yeo, G. W. Base editing: advances and therapeutic opportunities. Nat. Rev. Drug Discov. 19, 839–859 (2020).

pubmed: 33077937 pmcid: 7721651 doi: 10.1038/s41573-020-0084-6

Lue, N. Z. et al. Base editor scanning charts the DNMT3A activity landscape. Nat. Chem. Biol. 19, 176–186 (2023).

pubmed: 36266353 doi: 10.1038/s41589-022-01167-4

Anzalone, A. V. et al. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019).

pubmed: 31634902 pmcid: 6907074 doi: 10.1038/s41586-019-1711-4

Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).

pubmed: 27984732 pmcid: 5181115 doi: 10.1016/j.cell.2016.11.038

Andreadis, A., Gallego, M. E. & Nadal-Ginard, B. Generation of protein isoform diversity by alternative splicing: mechanistic and biological implications. Annu. Rev. Cell Biol. 3, 207–242 (1987).

pubmed: 2891362 doi: 10.1146/annurev.cb.03.110187.001231

Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

pubmed: 11125122 pmcid: 29783 doi: 10.1093/nar/29.1.308

den Dunnen, J. T. Describing sequence variants using HGVS nomenclature. in Genotyping: Methods and Protocols (eds White S. J. & Cantsilieris S.) 243–251 (Springer New York, 2017).

Apweiler, R. et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).

pubmed: 14681372 pmcid: 308865 doi: 10.1093/nar/gkh131

Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2022).

pmcid: 9825485 doi: 10.1093/nar/gkac888

Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).

pubmed: 11752248 pmcid: 99161 doi: 10.1093/nar/30.1.38

Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61–D65 (2007).

pubmed: 17130148 doi: 10.1093/nar/gkl842

Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).

pmcid: 8728224 doi: 10.1093/nar/gkab1061

Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).

pubmed: 35388217 pmcid: 9007741 doi: 10.1038/s41586-022-04558-8

Hornbeck, P. V. et al. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–D520 (2015).

pubmed: 25514926 doi: 10.1093/nar/gku1267

Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).

pubmed: 31679514 pmcid: 6827219 doi: 10.1186/s13059-019-1845-6

Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc. 8, 1551–1566 (2013).

pubmed: 23868073 pmcid: 6519453 doi: 10.1038/nprot.2013.092

Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).

pubmed: 6667333 doi: 10.1002/bip.360221211

Dana, J. M. et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 47, D482–D489 (2019).

pubmed: 30445541 doi: 10.1093/nar/gky1114

Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48, D335–D343 (2020).

pubmed: 31691821

Schrödinger, L. The PyMOL Molecular Graphics System, version 1.8 (2015).

Sancho, P. et al. Characterization of molecular mechanisms underlying the axonal Charcot–Marie–Tooth neuropathy caused by mutations. Hum. Mol. Genet 28, 1629–1644 (2019).

pubmed: 30624633 doi: 10.1093/hmg/ddz006

Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).

pubmed: 37733863 doi: 10.1126/science.adg7492

Ramos, E. M. et al. Characterizing genetic variants for clinical action. Am. J. Med. Genet. C Semin. Med. Genet. 166, 93–104 (2014).

doi: 10.1002/ajmg.c.31386

Lau, T. K. & Leung, T. N. Genetic screening and diagnosis. Curr. Opin. Obstet. Gynecol. 17, 163–169 (2005).

pubmed: 15758610 doi: 10.1097/01.gco.0000162187.99219.e0

Stark, Z. & Scott, R. H. Genomic newborn screening for rare diseases. Nat. Rev. Genet. 24, 755–766 (2023).

pubmed: 37386126 doi: 10.1038/s41576-023-00621-w

Hoffman-Andrews, L. The known unknown: the challenges of genetic variants of uncertain significance in clinical practice. J. Law Biosci. 4, 648–657 (2017).

pubmed: 29868193 doi: 10.1093/jlb/lsx038

Carter, T. C. & He, M. M. Challenges of identifying clinically actionable genetic variants for precision medicine. J. Healthc. Eng. https://doi.org/10.1155/2016/3617572 (2016).

Woodard, J., Iqbal, S. & Mashaghi, A. Circuit topology predicts pathogenicity of missense mutations. Proteins 90, 1634–1644 (2022).

pubmed: 35394672 pmcid: 9543832 doi: 10.1002/prot.26342

Iqbal, S. et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc. Natl Acad. Sci. USA 117, 28201–28211 (2020).

pubmed: 33106425 pmcid: 7668189 doi: 10.1073/pnas.2002660117

Iqbal, S. et al. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe. Nucleic Acids Res. 48, gkaa361 (2020).

doi: 10.1093/nar/gkaa361

Costain, G. & Andrade, D. M. Third-generation computational approaches for genetic variant interpretation. Brain 146, 411–412 (2023).

pubmed: 36691296 doi: 10.1093/brain/awad011

Watkins, X., Garcia, L. J., Pundir, S., Martin, M. J. & Consortium, U. ProtVista: visualization of protein sequence annotations. Bioinformatics 33, 2040–2041 (2017).

pubmed: 28334231 pmcid: 5963392 doi: 10.1093/bioinformatics/btx120

Bittrich, S. et al. RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics 38, 1452–1454 (2022).

pubmed: 34864908 doi: 10.1093/bioinformatics/btab813

Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun. https://doi.org/10.1038/s41467-019-10016-3 (2019).

Bragin, E. et al. DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res. 42, D993–D1000 (2014).

pubmed: 24150940 doi: 10.1093/nar/gkt937

Stephenson, J. D., Laskowski, R. A., Nightingale, A., Hurles, M. E. & Thornton, J. VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations. Bioinformatics 35, 4854–4856 (2019).

pubmed: 31192369 pmcid: 6853667 doi: 10.1093/bioinformatics/btz482

Stephenson, J. D. et al. ProtVar: mapping and contextualizing human missense variation. Nucleic Acids Res. https://doi.org/10.1093/nar/gkae413 (2024).

Hicks, M., Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Functional characterization of 3D protein structures informed by human genetic diversity. Proc. Natl Acad. Sci. USA 116, 8960–8965 (2019).

pubmed: 30988206 pmcid: 6500140 doi: 10.1073/pnas.1820813116

Iqbal, S. et al. Delineation of functionally essential protein regions for 242 neurodevelopmental genes. Brain 146, 519–533 (2022).

pmcid: 9924913 doi: 10.1093/brain/awac381

Meller, A. et al. Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Nat. Commun. 14, 1177 (2023).

pubmed: 36859488 pmcid: 9977097 doi: 10.1038/s41467-023-36699-3

Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).

pubmed: 15980494 pmcid: 1160148 doi: 10.1093/nar/gki387

Tiberti, M. et al. MutateX: an automated pipeline for in silico saturation mutagenesis of protein structures and structural ensembles. Brief. Bioinform. 23, bbac074 (2022).

pubmed: 35323860 doi: 10.1093/bib/bbac074

Smedley, D. et al. BioMart—biological queries made easy. BMC Genomics 10, 22 (2009).

pubmed: 19144180 pmcid: 2649164 doi: 10.1186/1471-2164-10-22

Segura, J., Rose, Y., Westbrook, J., Burley, S. K. & Duarte, J. M. RCSB Protein Data Bank 1D tools and services. Bioinformatics 36, btaa1012 (2020).

Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).

pubmed: 33956157 pmcid: 8262734 doi: 10.1093/nar/gkab314

Madeira, F. et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 50, W276–W279 (2022).

pubmed: 35412617 pmcid: 9252731 doi: 10.1093/nar/gkac240

Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).

pubmed: 12519945 pmcid: 165576 doi: 10.1093/nar/gkg129

Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).

pubmed: 21948594 doi: 10.1093/nar/gkr777

Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672 (2006).

pubmed: 16381955 doi: 10.1093/nar/gkj067

Weinreich, S. S., Mangon, R., Sikkens, J. J., Teeuw, M. E. E. & Cornel, M. C. Orphanet: a European database for rare diseases. Ned. Tijdschr. Geneeskd. 152, 518–519 (2008).

pubmed: 18389888

Hamosh, A., Scott, A. F., Amberger, J., Valle, D. & McKusick, V. A. Online Mendelian Inheritance In Man (OMIM). Hum. Mutat. 15, 57–61 (2000).

pubmed: 10612823 doi: 10.1002/(SICI)1098-1004(200001)15:1<57::AID-HUMU12>3.0.CO;2-G

McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. https://doi.org/10.1186/s13059-016-0974-4 (2016).

Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Subventions

Informations de copyright

Références

Auteurs

Seulki Kwon (S)

Jordan Safer (J)

Duyen T Nguyen (DT)

David Hoksza (D)

Patrick May (P)

Jeremy A Arbesfeld (JA)

Alan F Rubin (AF)

Arthur J Campbell (AJ)

Alex Burgin (A)

Sumaiya Iqbal (S)

Classifications MeSH