Balancing Cost and Quality: An Exploration of Human-in-the-Loop Frameworks for Automated Short Answer Scoring

Funayama, Hiroaki; Sato, Tasuku; Matsubayashi, Yuichiroh; Mizumoto, Tomoya; Suzuki, Jun; Inui, Kentaro

doi:10.1007/978-3-031-11644-5_38

Cited by 11 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Another approach, complementary to ours, is to use humanin-the-loop for validating low confidence outputs [24].…”

Section: Related Workmentioning

confidence: 99%

Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models

Phung¹,

Cambronero²,

Gulwani³

et al. 2023

Preprint

View full text Add to dashboard Cite

Large language models trained on code (LLMCs), such as Codex, hold great promise in enhancing programming education by automatically generating feedback for students. We investigate using LLMCs to generate feedback for fixing syntax errors in Python programs, a key scenario in introductory programming. More concretely, given a student's buggy program, our goal is to generate feedback comprising a fixed program along with a natural language explanation describing the errors/fixes, inspired by how a human tutor would give feedback. While using LLMCs is promising, the critical challenge is to ensure high precision in the generated feedback, which is imperative before deploying such technology in classrooms. The main research question we study is: Can we develop LLMCs-based feedback generation techniques with a tunable precision parameter, giving educators quality control over the feedback that students receive? To this end, we introduce PyFiXV, our technique to generate high-precision feedback powered by Codex. The key idea behind PyFiXV is to use a novel run-time validation mechanism to decide whether the generated feedback is suitable for sharing with the student; notably, this validation mechanism also provides a precision knob to educators. We perform an extensive evaluation using two real-world datasets of Python programs with syntax errors and show the efficacy of PyFiXV in generating high-precision feedback.

show abstract

“…Another approach, complementary to ours, is to use humanin-the-loop for validating low confidence outputs [24].…”

Section: Related Workmentioning

confidence: 99%

Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models

Phung¹,

Cambronero²,

Gulwani³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The endeavor to automate the scoring of self-explanation quality has seen the integration of NLP tools and cutting-edge neural network architectures [20]. Techniques like latent semantic analysis (LSA) and recurrent neural network (RNN) interfaced with machine learning underscore the capabilities of automated systems, often outshining traditional manual evaluation in both effectiveness and efficiency [14,[20][21][22][23][24]. Furthermore, semi-supervised learning techniques, which capitalize on abundant unlabeled data, have exhibited the potential to refine scoring accuracy [25].…”

Section: Automated Scoring Of Self-explanations: the Imperative For R...mentioning

confidence: 99%

Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach

Nakamoto,

Flanagan,

Yamauchi

et al. 2023

Computers

View full text Add to dashboard Cite

In the realm of mathematics education, self-explanation stands as a crucial learning mechanism, allowing learners to articulate their comprehension of intricate mathematical concepts and strategies. As digital learning platforms grow in prominence, there are mounting opportunities to collect and utilize mathematical self-explanations. However, these opportunities are met with challenges in automated evaluation. Automatic scoring of mathematical self-explanations is crucial for preprocessing tasks, including the categorization of learner responses, identification of common misconceptions, and the creation of tailored feedback and model solutions. Nevertheless, this task is hindered by the dearth of ample sample sets. Our research introduces a semi-supervised technique using the large language model (LLM), specifically its Japanese variant, to enrich datasets for the automated scoring of mathematical self-explanations. We rigorously evaluated the quality of self-explanations across five datasets, ranging from human-evaluated originals to ones devoid of original content. Our results show that combining LLM-based explanations with mathematical material significantly improves the model’s accuracy. Interestingly, there is an optimal limit to how many synthetic self-explanation data can benefit the system. Exceeding this limit does not further improve outcomes. This study thus highlights the need for careful consideration when integrating synthetic data into solutions, especially within the mathematics discipline.

show abstract

“…16 No. 4 近年では，グラフなどの構造化情報 (4) ，自然言語文 (5) などの，より抽象的な説明も検討されている．更に，予測に強い影響を与えた訓練事例による説明 (6) ，二つの予測を対比させた対比的説明 (7) ，予測を覆すための入力の編集の提示による反実仮想的な説明 (8) など，説明の形態そのものに切り込む検討もなされている．第 2 の軸は「説明生成の方法」である．まず，NLP システムは所与のものとして，そのシステムが出力する予測を事後的に説明しようとする post-hoc 法がある．入力を微小に変化させることによって出力ラベルに対する個々の入力単語の重要度を推定する方法 (3) ，予測関数を入力について微分することによって入力単語の重要度を見積もる方法 (9) など，様々な手法が提案されている．二つめの方向性は，初めから解釈可能なシステムを設計する self-explain 法である．入力の重要箇所を同定するモジュールを明示的に組み込み，そこからパイプライン的に予測を行う方式 (10), (11) ，予測モデルの隠れ層から言語生成器を用いて説明を生成する方式 (5) (13), (14) やこれまでの研究結果を統合的に分析したメタ分析などが存在する (15) ．ライティングにおける訂正的フィードバック(written corrective feedback)は，直接訂正，間接訂正，メタ言語的説明，フィードバックの焦点化，電子フィードバック，再構成など，様々な種類が存在する (16) (17) ，フィードバックにどのように関わっているか (18) ，また学習者に自分が教師から受け取りたいフィードバックを選択させるようなアプローチ (19) 精度が飛躍的に向上し (22), (23) ，人間の採点者に匹敵するような精度を示すモデルも登場している (24) ．昨今では，ほかの NLP 分野と同様に，SAS においても Transformer を用いたモデルの開発が盛んに行われている (25), (26) ．このように，長らくこの分野における研究の中心的な課題は専らモデルの予測精度の改善であり (27) (32) や反実仮想的な正答の生成 (33) ，予測の信頼性を表す確信度の活用 (34)…”

Section: 電子情報通信学会基礎・境界ソサイエティunclassified

Frontiers in Explainable Automated Writing Evaluation

Inui

Ishii

Matsubayashi

et al. 2023

IEICE Fundamentals Review

Self Cite

View full text Add to dashboard Cite

Explainability is a crucial component of natural language processing(NLP)systems. Explanation is communication, and research on explainability is expected to address notions such as communicative goals and common grounding. Automated Writing Evaluation(AWE)is an ideal field to contribute to such research. AWE refers to NLP tasks designed to support human learners by evaluating the quality of text produced in educational contexts, such as written answers to questions and argumentative essays, and providing constructive feedback. Since explanations are vital in educational assessment, pedagogical literature is rich with insights that can facilitate research on explainability in AWE systems. In this paper, what style of explanations are pedagogically suitable and technologically feasible at each layer of language production, ranging from content planning to surface realization, is explored and open research issues are identified to encourage researchers to enter this emerging field.

show abstract

Balancing Cost and Quality: An Exploration of Human-in-the-Loop Frameworks for Automated Short Answer Scoring

Cited by 11 publications

References 17 publications

Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models

Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models

Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach

Frontiers in Explainable Automated Writing Evaluation

Contact Info

Product

Resources

About