MathPrompter: Mathematical Reasoning using Large Language Models

Imani, Shima; Li, Du; Shrivastava, Harsh

doi:10.48550/arxiv.2303.05398

Cited by 16 publications

(17 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach successfully solves, explains and generates math problems at the university level. MathPrompter [66] employs a zero-shot chainof-thought prompting technique to generate multiple algebraic expressions or Python functions in varied ways to solve the same math problem, enhancing reliance on output results. PAL [44] introduces an innovative approach to bolster the performance of pre-trained language models (PLMs) in mathematical problem-solving.…”

Section: Tool-based Methodsmentioning

confidence: 99%

“…Recent research has highlighted the growing capabilities of LLMs in the field of mathematical word problem solving, emphasizing the trend toward more nuanced and sophisticated AI-driven mathematical analysis. MathPrompter [66] uses LLM called GPT3 DaVinci to solve MWPs with excellent results, demonstrating the potential of LLMs to not only explain but also generate complex mathematical reasoning, reflecting a human-like understanding of complex problem sets.…”

Section: Math Problem Solvingmentioning

confidence: 99%

See 1 more Smart Citation

Geography of Technology Transfer in China

Liu

2023

East China Normal University Scientific Reports

View full text Add to dashboard Cite

Graphs can model complex relationships between objects, enabling a myriad of Web applications such as online page/article classification and social recommendation. While graph neural networks (GNNs) have emerged as a powerful tool for graph representation learning, in an end-to-end supervised setting, their performance heavily relies on a large amount of task-specific supervision. To reduce labeling requirement, the "pre-train, fine-tune" and "pre-train, prompt" paradigms have become increasingly common. In particular, prompting is a popular alternative to fine-tuning in natural language processing, which is designed to narrow the gap between pre-training and downstream objectives in a task-specific manner. However, existing study of prompting on graphs is still limited, lacking a universal treatment to appeal to different downstream tasks. In this paper, we propose GraphPrompt, a novel pre-training and prompting framework on graphs. GraphPrompt not only unifies pre-training and downstream tasks into a common task template, but also employs a learnable prompt to assist a downstream task in locating the most relevant knowledge from the pre-trained model in a task-specific manner. Finally, we conduct extensive experiments on five public datasets to evaluate and analyze GraphPrompt. CCS CONCEPTS• Computing methodologies → Learning latent representations; • Information systems → Data mining.

show abstract

Section: Tool-based Methodsmentioning

confidence: 99%

Section: Math Problem Solvingmentioning

confidence: 99%

Geography of Technology Transfer in China

Liu

2023

East China Normal University Scientific Reports

View full text Add to dashboard Cite

show abstract

“…Previous work, such as that of Frieder et al (2023), has shown that advanced LLMs -specifically ChatGPT -tend to be highly inconsistent on mathematics tasks. Similarly, Imani et al (2023) found that hallucinations tend to be amplified when models attempt mathematical reasoning. We believe that equations and puzzles are useful testing grounds because the quality of the model's answers can be objectively evaluated.…”

Section: Introductionmentioning

confidence: 94%

Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks

Buszydlik,

Dobiczek,

Okoń

et al. 2023

Proceedings of the ART of Safety: Workshop on Adversarial Testing and Red-Teaming for Generative AI

View full text Add to dashboard Cite

We consider the problem of red teaming LLMs on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. We present a framework to procedurally generate numerical questions and puzzles, and compare the results with and without the application of several red teaming techniques. Our findings suggest that even though structured reasoning and providing worked-out examples slow down the deterioration of the quality of answers, the gpt-3.5-turbo and gpt-4 models are not well suited for elementary calculations and reasoning tasks, also when being red teamed. * Equal contribution ** https://github.com/RedTeamingforLLMs /RedTeamingforLLMs Technique Difficulty Red teaming Edit distance (characters) ↓ Relative edit distance (%) ↓ Relative distance (%) ↓ Accuracy (%) ↑ Code Leon Derczynski. 2023. Structured LLM Red-teaming. Visited on 2023-06-17.

show abstract

“…CoT summarization is related to several techniques that ask the LLM to outline its "thinking" before arriving at a final implementation (Wei et al 2022;Jiang et al 2023;Zheng et al 2023). A number of recent works also use programs as prompts (i.e., a structured chain of thought) in an attempt to help LLMs perform mathematical reasoning (Gao et al 2022;Imani, Du, and Shrivastava 2023). Related to our automated debugging, Xia and Zhang 2023a) consider a related paradigm, but where feedback comes from humans, rather than automated checks.…”

Section: Introductionmentioning

confidence: 99%

Generalized Planning in PDDL Domains with Pretrained Large Language Models

Silver,

Dan,

Srinivas

et al. 2024

AAAI

View full text Add to dashboard Cite

Recent work has considered whether large language models (LLMs) can function as planners: given a task, generate a plan. We investigate whether LLMs can serve as generalized planners: given a domain and training tasks, generate a program that efficiently produces plans for other tasks in the domain. In particular, we consider PDDL domains and use GPT-4 to synthesize Python programs. We also consider (1) Chain-of-Thought (CoT) summarization, where the LLM is prompted to summarize the domain and propose a strategy in words before synthesizing the program; and (2) automated debugging, where the program is validated with respect to the training tasks, and in case of errors, the LLM is re-prompted with four types of feedback. We evaluate this approach in seven PDDL domains and compare it to four ablations and four baselines. Overall, we find that GPT-4 is a surprisingly powerful generalized planner. We also conclude that automated debugging is very important, that CoT summarization has non-uniform impact, that GPT-4 is far superior to GPT-3.5, and that just two training tasks are often sufficient for strong generalization.

show abstract

MathPrompter: Mathematical Reasoning using Large Language Models

Cited by 16 publications

References 17 publications

Geography of Technology Transfer in China

Geography of Technology Transfer in China

Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks

Generalized Planning in PDDL Domains with Pretrained Large Language Models

Contact Info

Product

Resources

About