Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI).
Artificial intelligence
Biocuration
Knowledge graphs
Large language models
Ontologies
Ontology engineering
Journal
Journal of biomedical semantics
ISSN: 2041-1480
Titre abrégé: J Biomed Semantics
Pays: England
ID NLM: 101531992
Informations de publication
Date de publication:
17 Oct 2024
17 Oct 2024
Historique:
received:
03
06
2024
accepted:
08
09
2024
medline:
17
10
2024
pubmed:
17
10
2024
entrez:
16
10
2024
Statut:
epublish
Résumé
Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources. We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues. These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.
Sections du résumé
BACKGROUND
BACKGROUND
Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources.
RESULTS
RESULTS
We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues.
CONCLUSIONS
CONCLUSIONS
These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.
Identifiants
pubmed: 39415214
doi: 10.1186/s13326-024-00320-3
pii: 10.1186/s13326-024-00320-3
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
19Informations de copyright
© 2024. The Author(s).
Références
Gene Ontology Consortium. The gene ontology knowledgebase in 2023. Genetics. 2023 Mar 3; Available from: https://academic.oup.com/genetics/advance-article/doi/10.1093/genetics/iyad031/7068118 .
Vasilevsky NA, Matentzoglu NA, Toro S, Flack JE, Hegde H, Unni DR, et al. Mondo: Unifying diseases for the world, by the world. medRxiv. 2022. p. 2022–04. https://doi.org/10.1101/2022.04.13.22273750 .
Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 2012;13(1):R5.
doi: 10.1186/gb-2012-13-1-r5
Dooley DM, Griffiths EJ, Gosal GS, Buttigieg PL, Hoehndorf R, Lange MC, et al. FoodOn: a harmonized food ontology to increase global food traceability, quality control and data integration. NPJ Sci Food. 2018;18(2):23.
doi: 10.1038/s41538-018-0032-6
Malladi VS, Erickson DT, Podduturi NR, Rowe LD, Chan ET, Davidson JM, et al. Ontology application and use at the ENCODE DCC. Database. 2015;2015:bav010-.
doi: 10.1093/database/bav010
Osumi-Sutherland D, Xu C, Keays M, Levine AP, Kharchenko PV, Regev A, et al. Cell type ontologies of the Human Cell Atlas. Nat Cell Biol. 2021;23(11):1129–35.
doi: 10.1038/s41556-021-00787-7
Hastings J. Primer on Ontologies. In: Dessimoz C, Škunca N, editors. The Gene Ontology Handbook. New York: Springer New York; 2017. p. 3–13.
Diehl AD, Meehan TF, Bradford YM, Brush MH, Dahdul WM, Dougall DS, et al. The cell ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semantics. 2016;7(1):44.
doi: 10.1186/s13326-016-0088-7
Osumi-Sutherland D, Reeve S, Mungall CJ, Neuhaus F, Ruttenberg A, Jefferis GSXE, et al. A strategy for building neuroanatomy ontologies. Bioinformatics. 2012;28(9):1262–9.
doi: 10.1093/bioinformatics/bts113
Horridge M, Knublauch H, Rector A, Stevens R, Wroe C. A Practical guide to building OWL ontologies using the protégé-OWL plugin and CO-ODE tools edition 1.0. University of Manchester. 2004; Available from: http://www.cse.buffalo.edu/faculty/shapiro/Courses/CSE663/Fall07/ProtegeOWLTutorial.pdf .
Jackson RC, Balhoff JP, Douglass E, Harris NL, Mungall CJ, Overton JA. ROBOT: a tool for automating ontology workflows. BMC Bioinform. 2019;20(1):407.
doi: 10.1186/s12859-019-3002-3
Rector AL. Modularisation of domain ontologies implemented in description logics and related formalisms including OWL. In: Proceedings of the 2nd international conference on Knowledge capture. Sanibel Island: ACM; 2003. p. 121–8.
Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, et al. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42(Database issue):D966-74.
doi: 10.1093/nar/gkt1026
Köhler S, Bauer S, Mungall CJ, Carletti G, Smith CL, Schofield P, et al. Improving ontologies by automatic reasoning and evaluation of logical definitions. BMC Bioinformatics. 2011;27(12):418.
doi: 10.1186/1471-2105-12-418
Hill DP, Adams N, Bada M, Batchelor C, Berardini TZ, Dietze H, et al. Dovetailing biology and chemistry: integrating the Gene Ontology with the ChEBI chemical ontology. BMC Genomics. 2013;14(1):513.
doi: 10.1186/1471-2164-14-513
Mungall CJ, Bada M, Berardini TZ, Deegan J, Ireland A, Harris MA, et al. Cross-product extensions of the Gene Ontology. J Biomed Inform. 2011;44(1):80–6.
doi: 10.1016/j.jbi.2010.02.002
Asim MN, Wasim M, Khan MUG, Mahmood W, Abbasi HM. A survey of ontology learning techniques and applications. Database. 2018;2018. Available from: https://academic.oup.com/database/article-pdf/doi/10.1093/database/bay101/27329264/bay101.pdf . Cited 2023 Nov 24.
Ristoski P, Paulheim H. RDF2Vec: RDF graph embeddings and their applications. International semantic web conference. Springer; 2016. p. 498–514. https://doi.org/10.1007/978-3-319-46523-4_30 .
Chen J, Hu P, Jimenez-Ruiz E, Holter OM, Antonyrajah D, Horrocks I. OWL2Vec*: embedding of OWL ontologies. Mach Learn. 2021;110(7):1813–45.
doi: 10.1007/s10994-021-05997-6
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, et al. Training language models to follow instructions with human feedback. arXiv [cs.CL]. 2022. Available from: http://arxiv.org/abs/2203.02155 .
OpenAI. GPT-4 technical report. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2303.08774 .
Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, et al. A survey on evaluation of large language models. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2307.03109 .
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2303.18223v12 .
Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto HP, Kaplan J, et al. Evaluating Large Language Models Trained on Code. arXiv [cs.LG]. 2021. Available from: http://arxiv.org/abs/2107.03374 .
Matentzoglu N, Goutte-Gattat D, Tan SZK, Balhoff JP, Carbon S, Caron AR, Development O, Kit, et al. A toolkit for building, maintaining and standardizing biomedical ontologies. Database. 2022. Available from: https://doi.org/10.1093/database/baac087 .
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459–74.
Jackson RC, Matentzoglu N, Overton JA, Vita R, Balhoff JP, Buttigieg PL, et al. OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies. bioRxiv. 2021. p. 2021.06.01.446587. Available from: https://www.biorxiv.org/content/biorxiv/early/2021/06/02/2021.06.01.446587 . Cited 2023 Dec 9.
Mikolov T. Efficient estimation of word representations in vector space. arXiv:1301.3781 [preprint]. 2013. https://doi.org/10.48550/arXiv.1301.3781 .
Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome Biol. 2005;6(5):R46.
doi: 10.1186/gb-2005-6-5-r46
Soroush A, Glicksberg BS, Zimlichman E, Barash Y, Freeman R, Charney AW, et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI. 2024;1(5):AIdbp2300040.
doi: 10.1056/AIdbp2300040
docs.trychroma.com. 2023. Available from: https://docs.trychroma.com/. Cited 2023 Dec 15.
Malkov YA, Yashunin DA. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv [cs.DS]. 2016. Available from: http://arxiv.org/abs/1603.09320 .
Carbonell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: Association for Computing Machinery; 1998. p. 335–6. (SIGIR ’98).
Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. arXiv [cs.AI]. 2023. Available from: http://arxiv.org/abs/2304.02711 .
Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49(D1):D1207–17.
doi: 10.1093/nar/gkaa1043
Smith CL, Goldsmith W, Eppig JT. The Mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol. 2005;6(1). Available from: https://doi.org/10.1186/gb-2004-6-1-r7 .
Buttigieg PL, Pafilis E, Lewis SE, Schildhauer MP, Walls RL, Mungall CJ. The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J Biomed Semantics. 2016;7(1):57.
doi: 10.1186/s13326-016-0097-6
Bandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC, et al. The ontology for biomedical investigations. PLoS ONE. 2016;11(4):e0154556.
doi: 10.1371/journal.pone.0154556
Stefancsik R, Balhoff JP, Balk MA, Ball RL, Bello SM, Caron AR, et al. The Ontology of Biological Attributes (OBA)-computational traits for the life sciences. Mamm Genome. 2023;34(3):364–78.
doi: 10.1007/s00335-023-09992-1
Khadir AC, Aliane H, Guessoum A. Ontology learning: grand tour and challenges. Comput Sci Rev. 2021;1(39):100339.
doi: 10.1016/j.cosrev.2020.100339
Gillis J, Pavlidis P. Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA). BMC Bioinform. 2013;14 Suppl 3(Suppl 3):S15.
doi: 10.1186/1471-2105-14-S3-S15
Kazakov Y, Krötzsch M, Simančík F. The Incredible ELK. J Automat Reason. 2014;53(1):1–61.
doi: 10.1007/s10817-013-9296-3
Mungall CJ, Dietze H, Osumi-Sutherland D. Use of OWL within the gene ontology. bioRxiv. 2014. p. 010090. Available from: https://www.biorxiv.org/content/10.1101/010090 . Cited 2023 Dec 10.
Powers D. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. 2008; Available from: http://dx.doi.org/ .
Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: evaluating text generation with BERT. arXiv [cs.CL]. 2019. Available from: http://arxiv.org/abs/1904.09675 .
dragon-ai-results. Github. Available from: https://github.com/monarch-initiative/dragon-ai-results . Cited 2024 May 21.
Dragon-Ai E. DRAGON-AI Results Analysis. Zenodo. 2023. Available from: https://zenodo.org/records/10183232 .
Toro S, Mungall CJ. Expert rankings of definitions across multiple ontologies. 2024. Available from: https://huggingface.co/datasets/MonarchInit/dragon-ai-definition-evals . Cited 2023 Dec 15.
cell-ontology. Github. Available from: https://github.com/obophenotype/cell-ontology/issues/2241 . Cited 2024 Jul 31.
Wang X, Li B, Song Y, Xu FF, Tang X, Zhuge M, et al. OpenDevin: an open platform for AI software developers as generalist agents. arXiv [cs.SE]. 2024. Available from: http://arxiv.org/abs/2407.16741 . Cited 2024 Jul 28.
Dohmke T, Iansiti M, Richards G. Sea change in software development: economic and productivity analysis of the ai-powered developer lifecycle. arXiv [econ.GN]. 2023. Available from: http://arxiv.org/abs/2306.15033 .
Dakhel AM, Majdinasab V, Nikanjam A, Khomh F, Desmarais MC, Jiang ZM. GitHub copilot AI pair programmer: asset or liability? J Syst Softw. 2023;203:111734.
doi: 10.1016/j.jss.2023.111734
Bowman SR. Eight things to know about large language models. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2304.00612 .
Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. 2023. Available from: https://papers.ssrn.com/abstract=4375283 . Cited 2023 Sep 25.
Roberts M, Thakur H, Herlihy C, White C, Dooley S. Data contamination through the lens of time. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2310.10628 .
Li C, Flanigan J. Task contamination: language models may not be few-shot anymore. arXiv [cs.CL]. 2023. Available from: http://arxiv.org/abs/2312.16337 .
ecosim-ontology: EXPERIMENTAL derivation of ontology from ecosim. Github. Available from: https://github.com/bioepic-data/ecosim-ontology . Cited 2024 May 21.
Balhoff JP, Bayindir U, Caron AR. Ubergraph: integrating OBO ontologies into a unified semantic graph. http://ceur-ws org … . 2022; Available from: https://icbo-conference.github.io/icbo2022/papers/ICBO-2022_paper_5005.pdf .
Osumi-Sutherland D, Courtot M, Balhoff JP, Mungall C. Dead simple OWL design patterns. J Biomed Semantics. 2017;8(1):18.
doi: 10.1186/s13326-017-0126-0
Kindermann C, Lupp DP, Sattler U, Thorstensen E. Generating Ontologies from Templates: A Rule-Based Approach for Capturing Regularity. :13. https://ceur-ws.org/Vol-2211/paper-22.pdf .
Moxon S, Solbrig H, Unni D, Jiao D, Bruskiewich R, Balhoff J, Vaidya G, Duncan W, Hegde H, Miller M, Brush M, Harris N, Haendel M, Mungall C. The linked data modeling language (LinkML): A general-purpose data modeling framework grounded in machine-readable semantics. 2021 International Conference on Biomedical Ontologies, ICBO 2021, 3073. 2021. p. 148–151.
curate-gpt: LLM-driven curation assist tool (pre-alpha). Github. Available from: https://github.com/monarch-initiative/curate-gpt . Cited 2023 Dec 14.
ChatGPT. ChatGPT - ROBOT-template helper. Available from: https://chatgpt.com/g/g-mGG79L6UW-robot-template-helper . Cited 2024 May 30.
Joachimiak MP, Miller MA, Harry Caufield J, Ly R, Harris NL, Tritt A, et al. The Artificial Intelligence Ontology: LLM-assisted construction of AI concept hierarchies. arXiv [cs.LG]. 2024. Available from: http://arxiv.org/abs/2404.03044 .