Evaluation of large language models for discovery of gene set function.
Journal
ArXiv
ISSN: 2331-8422
Titre abrégé: ArXiv
Pays: United States
ID NLM: 101759493
Informations de publication
Date de publication:
07 Sep 2023
07 Sep 2023
Historique:
pubmed:
21
9
2023
medline:
21
9
2023
entrez:
21
9
2023
Statut:
epublish
Résumé
Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI's GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in 'omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.
Types de publication
Preprint
Langues
eng
Subventions
Organisme : NCI NIH HHS
ID : U54 CA274502
Pays : United States
Organisme : NHGRI NIH HHS
ID : U24 HG012107
Pays : United States
Organisme : NIH HHS
ID : OT2 OD032742
Pays : United States
Organisme : NIMH NIH HHS
ID : U01 MH115747
Pays : United States
Organisme : NCI NIH HHS
ID : U24 CA269436
Pays : United States