Evaluation of large language models for discovery of gene set function.

Journal

Research square

Titre abrégé: Res Sq

Pays: United States

ID NLM: 101768035

Informations de publication

Date de publication:
18 Sep 2023

Historique:

pubmed: 4 10 2023

medline: 4 10 2023

entrez: 4 10 2023

Statut: epublish

Résumé

Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI's GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in 'omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.

Identifiants

DOI: 10.21203/rs.3.rs-3270331/v1 PMID: 37790547 PMC: PMC10543283

pubmed: 37790547

doi: 10.21203/rs.3.rs-3270331/v1

pmc: PMC10543283

pii:

doi:

Types de publication

Preprint

Langues

eng

Evaluation of large language models for discovery of gene set function.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Auteurs

Mengzhou Hu (M)

Sahar Alkhairy (S)

Ingoo Lee (I)

Rudolf T Pillich (RT)

Robin Bachelder (R)

Trey Ideker (T)

Dexter Pratt (D)

Classifications MeSH