GPAD: a natural language processing-based application to extract the gene-disease association discovery information from OMIM.

Gene discovery Gene-disease relationship Mendelian disorder NLP Rare disease gene Trends in gene discovery

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
27 Feb 2024
Historique:
received: 15 08 2023
accepted: 09 02 2024
medline: 28 2 2024
pubmed: 28 2 2024
entrez: 27 2 2024
Statut: epublish

Résumé

Thousands of genes have been associated with different Mendelian conditions. One of the valuable sources to track these gene-disease associations (GDAs) is the Online Mendelian Inheritance in Man (OMIM) database. However, most of the information in OMIM is textual, and heterogeneous (e.g. summarized by different experts), which complicates automated reading and understanding of the data. Here, we used Natural Language Processing (NLP) to make a tool (Gene-Phenotype Association Discovery (GPAD)) that could syntactically process OMIM text and extract the data of interest. GPAD applies a series of language-based techniques to the text obtained from OMIM API to extract GDA discovery-related information. GPAD can inform when a particular gene was associated with a specific phenotype, as well as the type of validation-whether through model organisms or cohort-based patient-matching approaches-for such an association. GPAD extracted data was validated with published reports and was compared with large language model. Utilizing GPAD's extracted data, we analysed trends in GDA discoveries, noting a significant increase in their rate after the introduction of exome sequencing, rising from an average of about 150-250 discoveries each year. Contrary to hopes of resolving most GDAs for Mendelian disorders by now, our data indicate a substantial decline in discovery rates over the past five years (2017-2022). This decline appears to be linked to the increasing necessity for larger cohorts to substantiate GDAs. The rising use of zebrafish and Drosophila as model organisms in providing evidential support for GDAs is also observed. GPAD's real-time analyzing capacity offers an up-to-date view of GDA discovery and could help in planning and managing the research strategies. In future, this solution can be extended or modified to capture other information in OMIM and scientific literature.

Sections du résumé

BACKGROUND BACKGROUND
Thousands of genes have been associated with different Mendelian conditions. One of the valuable sources to track these gene-disease associations (GDAs) is the Online Mendelian Inheritance in Man (OMIM) database. However, most of the information in OMIM is textual, and heterogeneous (e.g. summarized by different experts), which complicates automated reading and understanding of the data. Here, we used Natural Language Processing (NLP) to make a tool (Gene-Phenotype Association Discovery (GPAD)) that could syntactically process OMIM text and extract the data of interest.
RESULTS RESULTS
GPAD applies a series of language-based techniques to the text obtained from OMIM API to extract GDA discovery-related information. GPAD can inform when a particular gene was associated with a specific phenotype, as well as the type of validation-whether through model organisms or cohort-based patient-matching approaches-for such an association. GPAD extracted data was validated with published reports and was compared with large language model. Utilizing GPAD's extracted data, we analysed trends in GDA discoveries, noting a significant increase in their rate after the introduction of exome sequencing, rising from an average of about 150-250 discoveries each year. Contrary to hopes of resolving most GDAs for Mendelian disorders by now, our data indicate a substantial decline in discovery rates over the past five years (2017-2022). This decline appears to be linked to the increasing necessity for larger cohorts to substantiate GDAs. The rising use of zebrafish and Drosophila as model organisms in providing evidential support for GDAs is also observed.
CONCLUSIONS CONCLUSIONS
GPAD's real-time analyzing capacity offers an up-to-date view of GDA discovery and could help in planning and managing the research strategies. In future, this solution can be extended or modified to capture other information in OMIM and scientific literature.

Identifiants

pubmed: 38413851
doi: 10.1186/s12859-024-05693-x
pii: 10.1186/s12859-024-05693-x
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

84

Subventions

Organisme : CIHR
ID : GP1-155868
Pays : Canada

Informations de copyright

© 2024. The Author(s).

Références

Gusella JF, Wexler NS, Conneally PM, Naylor SL, Anderson MA, Tanzi RE, et al. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature. 1983;306(5940):234–8.
doi: 10.1038/306234a0
Wright CF, FitzPatrick DR, Firth HV. Paediatric genomics: diagnosing rare disease in children. Nat Rev Genet. 2018;19(5):253–68.
doi: 10.1038/nrg.2017.116
Bosch E, Casals F. Next-generation sequencing for rare diseases. In: Appasani K, editor. Genome-Wide Association Studies [Internet]. Cambridge: Cambridge University Press; 2015 [cited 2019 Mar 24]. p. 231–42. Available from: https://www.cambridge.org/core/product/identifier/CBO9781107337459A028/type/book_part
Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am J Hum Genet. 2015;97(2):199–215.
doi: 10.1016/j.ajhg.2015.06.009
Wilczewski CM, Obasohan J, Paschall JE, Zhang S, Singh S, Maxwell GL, et al. Genotype first: clinical genomics research through a reverse phenotyping approach. Am J Hum Genet. 2023;110(1):3–12.
doi: 10.1016/j.ajhg.2022.12.004
Garret P, Chevarin M, Vitobello A, Verdez S, Fournier C, Verloes A, et al. A second look at exome sequencing data: detecting mobile elements insertion in a rare disease cohort. Eur J Hum Genet. 2023;31(7):761–8.
doi: 10.1038/s41431-022-01250-3
Zhang P, Itan Y. Biological network approaches and applications in rare disease studies. Genes. 2019;10(10):797.
doi: 10.3390/genes10100797
Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16(6):321–32.
doi: 10.1038/nrg3920
Brasil S, Pascoal C, Francisco R, Ferreira VDR, Videira PA, Valadão G. Artificial intelligence (AI) in rare diseases: Is the future brighter? Genes. 2019;10(12):978.
doi: 10.3390/genes10120978
Liu Z, Zhu L, Roberts R, Tong W. Toward clinical implementation of next-generation sequencing-based genetic testing in rare diseases: Where are we? Trends Genet. 2019;35(11):852–67.
doi: 10.1016/j.tig.2019.08.006
Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an Online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015;43(D1):D789–98.
Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: Leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 2019;47(D1):D1038–43.
Boycott KM, Azzariti DR, Hamosh A, Rehm HL. Seven years since the launch of the Matchmaker Exchange: the evolution of genomic matchmaking. Hum Mutat. 2022;43(6):659–67.
Osmond M, Hartley T, Dyment DA, Kernohan KD, Brudno M, Buske OJ, et al. Outcome of over 1500 matches through the Matchmaker Exchange for rare disease gene discovery: The 2-year experience of Care4Rare Canada. Genet Med. 2022;24(1):100–8.
doi: 10.1016/j.gim.2021.08.014
Austin CP, Cutillo CM, Lau LPL, Jonker AH, Rath A, Julkowska D, et al. Future of rare diseases research 2017–2027: an IRDiRC perspective. Clin Transl Sci. 2018;11(1):21–7.
doi: 10.1111/cts.12500
Boycott KM, Rath A, Chong JX, Hartley T, Alkuraya FS, Baynam G, et al. International cooperation to enable the diagnosis of all rare genetic diseases. Am J Hum Genet. 2017;100(5):695–705.
doi: 10.1016/j.ajhg.2017.04.003
Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA, Brudno M, et al. The matchmaker exchange: a platform for rare disease gene discovery. Hum Mutat. 2015;36(10):915–21.
doi: 10.1002/humu.22858
Wangler MF, Yamamoto S, Chao HT, Posey JE, Westerfield M, Postlethwait J, et al. Model organisms facilitate rare disease diagnosis and therapeutic research. Genetics. 2017;207(1):9–27.
doi: 10.1534/genetics.117.203067
Lakshmi KS, Kumar GS. Association rule extraction from medical transcripts of diabetic patients. In: 5th International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2014. IEEE Computer Society; 2014. p. 201–6.
Hahn U, Oleynik M. Medical information extraction in the age of deep learning. Yearb Med Inform. 2020;29(1):208–20.
doi: 10.1055/s-0040-1702001
OMIM Entry Symbols [Internet]. [cited 2022 Aug 15]. Available from: https://omim.org/help/faq#1_3
Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–55.
doi: 10.1038/nrg3031
SpaCy: ML-based NLP library for Python. Version 3.0.0 [Internet]. [cited 2022 Aug 15]. Available from: https://spacy.io/
Bamshad MJ, Nickerson DA, Chong JX. Mendelian gene discovery: fast and furious with no end in sight. Am J Hum Genet. 2019;105(3):448–55.
doi: 10.1016/j.ajhg.2019.07.011
Ehrhart F, Willighagen EL, Kutmon M, van Hoften M, Curfs LMG, Evelo CT. A resource to explore the discovery of rare diseases and their causative genes. Sci Data. 2021;8(1):1–8.
doi: 10.1038/s41597-021-00905-y
PubMed [Internet]. [cited 2022 Sep 9]. Available from: https://pubmed.ncbi.nlm.nih.gov/
Carss KJ, Arno G, Erwood M, Stephens J, Sanchis-Juan A, Hull S, et al. Comprehensive rare variant analysis via whole-genome sequencing to determine the molecular pathology of inherited retinal disease. Am J Hum Genet. 2017;100(1):75–90.
doi: 10.1016/j.ajhg.2016.12.003
Chung CC, Wong WH, Fung JL, Hong Kong RD, Chung BH. Impact of COVID-19 pandemic on patients with rare disease in Hong Kong. Eur J Med Genet. 2020;63(12):104062.
doi: 10.1016/j.ejmg.2020.104062
Chung CCY, Ng YNC, Jain R, Chung BHY. A thematic study: impact of COVID-19 pandemic on rare disease organisations and patients across ten jurisdictions in the Asia Pacific region. Orphanet J Rare Dis. 2021;16(1):119.
doi: 10.1186/s13023-021-01766-9
Arsenault C, Gage A, Kim MK, Kapoor NR, Akweongo P, Amponsah F, et al. COVID-19 and resilience of healthcare systems in ten countries. Nat Med. 2022;28(6):1314–24.
doi: 10.1038/s41591-022-01750-1
Haldane V, De Foo C, Abdalla SM, Jung AS, Tan M, Wu S, et al. Health systems resilience in managing the COVID-19 pandemic: lessons from 28 countries. Nat Med. 2021;27(6):964–80.
doi: 10.1038/s41591-021-01381-y
Sohrabi C, Mathew G, Franchi T, Kerwan A, Griffin M, Soleil C Del Mundo J, et al. Impact of the coronavirus (COVID-19) pandemic on scientific research and implications for clinical academic training—a review. Int J Surg. 2021;86:57–63
Ghezzi D, Baruffini E, Haack TB, Invernizzi F, Melchionda L, Dallabona C, et al. Mutations of the mitochondrial-tRNA modifier MTO1 cause hypertrophic cardiomyopathy and lactic acidosis. Am J Hum Genet. 2012;90(6):1079–87.
doi: 10.1016/j.ajhg.2012.04.011
Makhija DT, Jagtap AG. Studies on sensitivity of zebrafish as a model organism for Parkinson′s disease: comparison with rat model. J Pharmacol Pharmacother. 2014;5(1):39–46.
doi: 10.4103/0976-500X.124422
de Abreu MS, Genario R, Giacomini ACVV, Demin KA, Lakstygal AM, Amstislavskaya TG, et al. Zebrafish as a model of neurodevelopmental disorders. Neuroscience. 2020;1(445):3–11.
doi: 10.1016/j.neuroscience.2019.08.034
Beck AP, Meyerholz DK. Evolving challenges to model human diseases for translational research. Cell Tissue Res. 2020;380(2):305–11.
doi: 10.1007/s00441-019-03134-3
Howe DG, Blake JA, Bradford YM, Bult CJ, Calvi BR, Engel SR, et al. Model organism data evolving in support of translational medicine. Lab Anim. 2018;47(10):277–89.
doi: 10.1038/s41684-018-0150-4
Jones DW, Russell G, Allford SL, Burdon K, Hawkins GA, Bowden DW, et al. Severe prekallikrein deficiency associated with homozygosity for an Arg94Stop nonsense mutation. Br J Haematol. 2004;127(2):220–3.
doi: 10.1111/j.1365-2141.2004.05180.x
Lombardi AM, Sartori MT, Cabrio L, Fadin M, Zanon E, Girolami A. Severe prekallikrein (Fletcher factor) deficiency due to a compound heterozygosis (383Trp stop codon and Cys529Tyr). Thromb Haemost. 2003;90(6):1040–5.
Sun XM, Patel DD, Knight BL, Soutar AK. Comparison of the genetic defect with LDL-receptor activity in cultured cells from patients with a clinical diagnosis of heterozygous familial hypercholesterolemia. Arterioscler Thromb Vasc Biol. 1997;17(11):3092–101.
doi: 10.1161/01.ATV.17.11.3092
Austin-Tse C, Halbritter J, Zariwala MA, Gilberti RM, Gee HY, Hellman N, et al. Zebrafish ciliopathy screen plus human mutational analysis identifies C21orf59 and CCDC65 defects as causing primary ciliary dyskinesia. Am J Hum Genet. 2013;93(4):672–86.
doi: 10.1016/j.ajhg.2013.08.015
Horani A, Brody SL, Ferkol TW, Shoseyov D, Wasserman MG, Ta-shma A, et al. CCDC65 Mutation causes primary ciliary dyskinesia with normal ultrastructure and hyperkinetic cilia. PLoS ONE [Internet]. 2013 Aug 26 [cited 2021 Sep 29];8(8). Available from: https://pubmed.ncbi.nlm.nih.gov/23991085/
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open foundation and fine-tuned chat models [Internet]. arXiv; 2023 [cited 2023 Nov 8]. Available from: http://arxiv.org/abs/2307.09288
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards expert-level medical question answering with large language models [Internet]. arXiv; 2023 [cited 2023 Nov 29]. Available from: http://arxiv.org/abs/2305.09617
Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering [Internet]. arXiv; 2019 [cited 2023 Nov 29]. Available from: http://arxiv.org/abs/1909.06146
Abbott A. Rare-disease project has global ambitions. Nature. 2011;472(7341):17–17.
doi: 10.1038/472017a
Antonarakis SE, Beckmann JS. Mendelian disorders deserve more attention. Nat Rev Genet. 2006;7(4):277–82.
doi: 10.1038/nrg1826
McKusick VA. Mendelian inheritance in man and its online version. OMIM Am J Hum Genet. 2007;80(4):588–604.
doi: 10.1086/514346
Boycott KM, Vanstone MR, Bulman DE, MacKenzie AE. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat Rev Genet. 2013;14(10):681–91.
doi: 10.1038/nrg3555
Kremer LS, Bader DM, Mertes C, Kopajtich R, Pichler G, Iuso A, et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat Commun. 2017;8(1):15824.
doi: 10.1038/ncomms15824
Rahit KMTH, Tarailo-Graovac M. Genetic modifiers and rare mendelian disease. Genes. 2020;11(3):239.
doi: 10.3390/genes11030239
Ferreira CR. The burden of rare diseases. Am J Med Genet A. 2019;179(6):885–92.
doi: 10.1002/ajmg.a.61124
Sobreira N, Schiettecatte F, Valle D, Hamosh A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum Mutat. 2015;36(10):928–30.
doi: 10.1002/humu.22844
University of Washington. MyGene2. [cited 2024 Jan 14]. MyGene2. Available from: https://mygene2.org/MyGene2/
Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, Rajan D, et al. DECIPHER: database of chromosomal imbalance and phenotype in humans using ensembl resources. Am J Hum Genet. 2009;84(4):524–33.
doi: 10.1016/j.ajhg.2009.03.010
Sobreira NLM, Arachchi H, Buske OJ, Chong JX, Hutton B, Foreman J, et al. Matchmaker exchange. Curr Protocols Hum Genet. 2017;95(1):1–15.
Rodrigues EDS, Griffith S, Martin R, Antonescu C, Posey JE, Coban-Akdemir Z, Jhangiani SN, Doheny KF, Lupski JR, Valle D, Bamshad MJ. Variant-level matching for diagnosis and discovery: Challenges and opportunities. Hum Mut. 2022;43(6):782–90.
Tarailo-Graovac M, Drögemöller BI, Wasserman WW, Ross CJD, Van Den Ouweland AMW, Darin N, et al. Identification of a large intronic transposal insertion in SLC17A5 causing sialic acid storage disease. Orphanet J Rare Dis. 2017;12(1):28.
doi: 10.1186/s13023-017-0584-6
van Kuilenburg ABP, Tarailo-Graovac M, Richmond PA, Drögemöller BI, Pouladi MA, Leen R, et al. Glutaminase deficiency caused by short tandem repeat expansion in GLS. N Engl J Med. 2019;380(15):1433–41.
doi: 10.1056/NEJMoa1806627
Ishiura H, Doi K, Mitsui J, Yoshimura J, Matsukawa MK, Fujiyama A, et al. Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy. Nat Genet. 2018;50(4):581–90.
doi: 10.1038/s41588-018-0067-2
Sanchis-Juan A, Stephens J, French CE, Gleadall N, Mégy K, Penkett C, et al. Complex structural variants in Mendelian disorders: identification and breakpoint resolution using short- and long-read genome sequencing. Genome Med. 2018;10(1):95.
doi: 10.1186/s13073-018-0606-6
de Bruijn SE, Fiorentino A, Ottaviani D, Fanucchi S, Melo US, Corral-Serrano JC, et al. Structural variants create new topological-associated domains and ectopic retinal enhancer-gene contact in dominant retinitis pigmentosa. Am J Hum Genet. 2020;107(5):802–14.
doi: 10.1016/j.ajhg.2020.09.002
Chiang C, Scott AJ, Davis JR, Tsang EK, Li X, Kim Y, et al. The impact of structural variation on human gene expression. Nat Genet. 2017;49(5):692–9.
doi: 10.1038/ng.3834
Chakravarti A. Magnitude of Mendelian versus complex inheritance of rare disorders. Am J Med Genet Part A. 2021;185(11):3287–93.
doi: 10.1002/ajmg.a.62463
Monasky MM, Micaglio E, Ciconte G, Pappone C. Brugada syndrome: Oligogenic or Mendelian disease? Int J Mol Sci. 2020;21(5):1687.
doi: 10.3390/ijms21051687
Bjornsson HT. The Mendelian disorders of the epigenetic machinery. Genome Res. 2015;25(10):1473–81.
doi: 10.1101/gr.190629.115
Maroilley T, Tarailo-Graovac M. Uncovering missing heritability in rare diseases. Genes. 2019;10(4):275.
doi: 10.3390/genes10040275
Frederiksen SD, Avramović V, Maroilley T, Lehman A, Arbour L, Tarailo-Graovac M. Rare disorders have many faces: in silico characterization of rare disorder spectrum. Orphanet J Rare Dis. 2022;17(1):1–18.
doi: 10.1186/s13023-022-02217-9
Seaby EG, Rehm HL, O’Donnell-Luria A. Strategies to uplift novel mendelian gene discovery for improved clinical outcomes. Front Genet. 2021;17(12):935.

Auteurs

K M Tahsin Hassan Rahit (KMTH)

Departments of Biochemistry, Molecular Biology and Medical Genetics, Cumming School of Medicine, University of Calgary, Calgary, AB, T2N 4N1, Canada.
Alberta Children's Hospital Research Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada.

Vladimir Avramovic (V)

Departments of Biochemistry, Molecular Biology and Medical Genetics, Cumming School of Medicine, University of Calgary, Calgary, AB, T2N 4N1, Canada.
Alberta Children's Hospital Research Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada.

Jessica X Chong (JX)

Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, 98195, USA.
Brotman-Baty Institute, Seattle, WA, 98195, USA.

Maja Tarailo-Graovac (M)

Departments of Biochemistry, Molecular Biology and Medical Genetics, Cumming School of Medicine, University of Calgary, Calgary, AB, T2N 4N1, Canada. maja.tarailograovac@ucalgary.ca.
Alberta Children's Hospital Research Institute, University of Calgary, Calgary, AB, T2N 4N1, Canada. maja.tarailograovac@ucalgary.ca.

Classifications MeSH