Evaluating language models for mathematics through interactions

Collins, Katherine M.; Jiang, Albert Q.; Frieder, Simon; Wong, Lionel; Zilka, Miri; Bhatt, Umang; Lukasiewicz, Thomas; Wu, Yuhuai; Tenenbaum, Joshua B.; Hart, William; Gowers, Timothy; Li, Wenda; Weller, Adrian; Jamnik, Mateja

doi:10.1073/pnas.2318124121

Proc. Natl. Acad. Sci. U.S.A.

2024

DOI: 10.1073/pnas.2318124121

|View full text |Cite

Evaluating language models for mathematics through interactions

Katherine M. Collins,

Albert Q. Jiang,

Simon Frieder

et al.

Abstract: There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article5

Relationship

Self Cite0

Independent5

Authors

Journals

Cited by 7 publications

References 31 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies

Deng,

Jiang,

et al. 2024

Computers & Education

View full text Add to dashboard Cite

Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies

Deng,

Jiang,

et al. 2024

Computers & Education

View full text Add to dashboard Cite

Building machines that learn and think with people

Collins,

Sucholutsky,

Bhatt

et al. 2024

Nat Hum Behav

View full text Add to dashboard Cite

Beyond Mere Algorithm Aversion: Are Judgments About Computer Agents More Variable?

Buder,

Becker,

Bareiß

et al. 2024

Communication Research

View full text Add to dashboard Cite

Several studies have reported algorithm aversion, reflected in harsher judgments about computers that commit errors, compared to humans who commit the same errors. Two online studies ( N = 67, N = 252) tested whether similar effects can be obtained with a referential communication task. Participants were tasked with identifying Japanese kanji characters based on written descriptions allegedly coming from a human or an AI source. Crucially, descriptions were either flawed (ambiguous) or not. Both concurrent measures during experimental trials and pre-post questionnaire data about the source were captured. Study 1 revealed patterns of algorithm aversion but also pointed at an opposite effect of “algorithm benefit”: ambiguous descriptions by an AI (vs. human) were evaluated more negatively, but non-ambiguous descriptions were evaluated more positively, suggesting the possibility that judgments about AI sources exhibit larger variability. Study 2 tested this prediction. While human and AI sources did not differ regarding concurrent measures, questionnaire data revealed several patterns that are consistent with the variability explanation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Evaluating language models for mathematics through interactions

Cited by 7 publications

References 31 publications

Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies

Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies

Building machines that learn and think with people

Beyond Mere Algorithm Aversion: Are Judgments About Computer Agents More Variable?

Contact Info

Product

Resources

About