Evaluation of large language models for discovery of gene set function.

Journal

ArXiv

ISSN: 2331-8422

Titre abrégé: ArXiv

Pays: United States

ID NLM: 101759493

Informations de publication

Date de publication:
07 Sep 2023

Historique:

pubmed: 21 9 2023

medline: 21 9 2023

entrez: 21 9 2023

Statut: epublish

Résumé

Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI's GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in 'omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.

Identifiants

PMID: 37731657 PMC: PMC10508824

pubmed: 37731657

pii: 2309.04019

pmc: PMC10508824

pii:

Types de publication

Preprint

Langues

eng

Subventions

Organisme : NCI NIH HHS

ID : U54 CA274502

Pays : United States

Organisme : NHGRI NIH HHS

ID : U24 HG012107

Pays : United States

Organisme : NIH HHS

ID : OT2 OD032742

Pays : United States

Organisme : NIMH NIH HHS

ID : U01 MH115747

Pays : United States

Organisme : NCI NIH HHS

ID : U24 CA269436

Pays : United States

Evaluation of large language models for discovery of gene set function.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Subventions

Auteurs

Mengzhou Hu (M)

Sahar Alkhairy (S)

Ingoo Lee (I)

Rudolf T Pillich (RT)

Robin Bachelder (R)

Trey Ideker (T)

Dexter Pratt (D)

Classifications MeSH