Evaluating language models for mathematics through interactions.

AI human–computer interaction language models theorem proving

Journal

Proceedings of the National Academy of Sciences of the United States of America
ISSN: 1091-6490
Titre abrégé: Proc Natl Acad Sci U S A
Pays: United States
ID NLM: 7505876

Informations de publication

Date de publication:
11 Jun 2024
Historique:
medline: 3 6 2024
pubmed: 3 6 2024
entrez: 3 6 2024
Statut: ppublish

Résumé

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

Identifiants

pubmed: 38830100
doi: 10.1073/pnas.2318124121
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e2318124121

Subventions

Organisme : Alan Turing Institute (ATI)
ID : EP/N510129/1
Organisme : Alan Turing Institute (ATI)
ID : EP/V025279/1
Organisme : EU TAILOR
ID : 952215
Organisme : Leverhulme Trust
ID : ECF-2021-429
Organisme : ERC Advanced Grant ALEXANDRIA
ID : Project GA 742178
Organisme : EPSRC
ID : EP/T019603/1

Déclaration de conflit d'intérêts

Competing interests statement:K.M.C. participated in a part-time Student Researcher Position at Google DeepMind during part of the work; however, this work does not represent any of her work at Google nor the views of Google and was conducted entirely outside of the position. A.Q.J. similarly has been a part-time researcher at Mistral AI during part of the work. The study was formulated prior to either authors’ involvement and does not connect to their work. Y.W. joined xAI after the conception of this work; xAI is not involved.

Auteurs

Katherine M Collins (KM)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Albert Q Jiang (AQ)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Simon Frieder (S)

University of Oxford, Oxford OX1 4BH, United Kingdom.

Lionel Wong (L)

Massachusetts Institute of Technology, Cambridge, MA 02139.

Miri Zilka (M)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Umang Bhatt (U)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.
The Alan Turing Institute, London NW1 2DB, United Kingdom.
New York University, New York, NY 10011.

Thomas Lukasiewicz (T)

University of Oxford, Oxford OX1 4BH, United Kingdom.
Vienna University of Technology, Vienna 1040, Austria.

Yuhuai Wu (Y)

x.AI, New York, NY 10038.

Joshua B Tenenbaum (JB)

Massachusetts Institute of Technology, Cambridge, MA 02139.

William Hart (W)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Timothy Gowers (T)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.
Collége de France, Paris 75001, France.

Wenda Li (W)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Adrian Weller (A)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.
The Alan Turing Institute, London NW1 2DB, United Kingdom.

Mateja Jamnik (M)

University of Cambridge, Cambridge CB2 1TN, United Kingdom.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH