For some students, standardized tests serve as a conduit to disclose sensitive issues of harm or distress that may otherwise go unreported. By detecting this writing, known as crisis papers, testing programs have a unique opportunity to assist in mitigating the risk of harm to these students. The use of machine learning to automatically detect such writing is necessary in the context of online tests and automated scoring. To achieve a detection system that is accurate, humans must first consistently label the data that are used to train the model. This paper argues that the existing guidelines are not sufficient for this task and proposes a three-level rubric to guide the collection of the training data. In showcasing the fundamental machine learning procedures for creating an automatic text classification system, the following evidence emerges as support of the operational use of this rubric. First, hand-scorers largely agree with one another in assigning labels to text according to the rubric. Additionally, when this labeled data are used to train a baseline classifier, the model exhibits promising performance. Recommendations are made for improving the hand-scoring training process, with the ultimate goal of quickly and accurately assisting students in crisis.
Background Deep learning methods, where models do not use explicit features and instead rely on implicit features estimated during model training, suffer from an explainability problem. In text classification, saliency maps that reflect the importance of words in prediction are one approach toward explainability. However, little is known about whether the salient words agree with those identified by humans as important. Objectives The current study examines in‐line annotations from human annotators and saliency map annotations from a deep learning model (ELECTRA transformer) to understand how well both humans and machines provide evidence for their assigned label. Methods Data were responses to test items across a mix of United States subjects, states, and grades. Humans were trained to annotate responses to justify a crisis alert label and two model interpretability methods (LIME, Integrated Gradients) were used to obtain engine annotations. Human inter‐annotator agreement and engine agreement with the human annotators were computed and compared. Results and Conclusions Human annotators agreed with one another at similar rates to those observed in the literature on similar tasks. The annotations derived using the integrated gradients (IG) agreed with human annotators at higher rates than LIME on most metrics; however, both methods underperformed relative to the human annotators. Implications Saliency map‐based engine annotations show promise as a form of explanation, but do not reach human annotation agreement levels. Future work should examine the appropriate unit for annotation (e.g., word, sentence), other gradient based methods, and approaches for mapping the continuous saliency values to Boolean annotations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.