Validation on machine reading comprehension software without annotated labels: a property-based method

Chen, Songqiang; Jin, Shuo; Xie, Xue‐Jun

doi:10.1145/3468264.3468569

Cited by 21 publications

(4 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As introduced above, QA software has been widely used in daily human life, thus there is an urgent demand to assure the quality of its returned answers and reveal its undisclosed defects. But currently, almost all the NLP models, including the core models in QA software, are mainly tested in the referencebased paradigm Ribeiro et al (2020); Chen et al (2021b). As explained in Section 1, using this test paradigm, the testers must obtain a well-annotated benchmark dataset at first, which means that the manually annotated reference answers are mandatory during testing QA software.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

2023

Self Cite

View full text Add to dashboard Cite

Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate the performance of QA software, many benchmark datasets have been constructed to provide various test cases. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases are mandatory to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this work, we propose a novel testing method, qaAskeR + , with five new Metamorphic Relations for QA software. qaAskeR + does not refer to the annotated labels of test cases. Instead, based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer, qaAskeR + tests QA software by inspecting its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. Experimental results show that qaAskeR + can reveal quite a few violations that indicate actual answering issues on various mainstream QA software without using any pre-annotated labels.

show abstract

Section: Motivationmentioning

confidence: 99%

“…Besides, we propose to validate the machine reading comprehension (MRC) DL models with MT in our previous work Chen et al (2021b). It aims to provide the MRC models with one systematic and extensible assessment of language understanding capabilities against required linguistic properties.…”

Section: Metamorphic Testing For Deep Learning Softwarementioning

confidence: 99%

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

2023

Self Cite

View full text Add to dashboard Cite

show abstract

“…Constructing proper oracles has long been difficult for testing DNNs [93]. Metamorphic or differential testing has been used extensively to overcome the difficulty of explicitly establishing testing oracles [24,68,73,84]. Consequently, DNNs are considered incorrect if they produce inconsistent results.…”

Section: Related Workmentioning

confidence: 99%

MDPFuzz: Testing Models Solving Markov Decision Processes

Pang¹,

Yuan²,

Wang³

2021

Preprint

View full text Add to dashboard Cite

The Markov decision process (MDP) provides a mathematical framework for modeling sequential decision-making problems, many of which are crucial to security and safety, such as autonomous driving and robot control. The rapid development of artificial intelligence research has created efficient methods for solving MDPs, such as deep neural networks (DNNs), reinforcement learning (RL), and imitation learning (IL). However, these popular models for solving MDPs are neither thoroughly tested nor rigorously reliable.We present MDPFuzzer, the first blackbox fuzz testing framework for models solving MDPs. MDPFuzzer forms testing oracles by checking whether the target model enters abnormal and dangerous states. During fuzzing, MDPFuzzer decides which mutated state to retain by measuring if it can reduce cumulative rewards or form a new state sequence. We design efficient techniques to quantify the "freshness" of a state sequence using Gaussian mixture models (GMMs) and dynamic expectation-maximization (DynEM). We also prioritize states with high potential of revealing crashes by estimating the local sensitivity of target models over states.MDPFuzzer is evaluated on five state-of-the-art models for solving MDPs, including supervised DNN, RL, IL, and multi-agent RL. Our evaluation includes scenarios of autonomous driving, aircraft collision avoidance, and two games that are often used to benchmark RL. During a 12-hour run, we find over 80 crash-triggering state sequences on each model. We show inspiring findings that crash-triggering states, though look normal, induce distinct neuron activation patterns compared with normal states. We further develop an abnormal behavior detector to harden all the evaluated models and repair them with the findings of MDPFuzzer to significantly enhance their robustness without sacrificing accuracy.

show abstract

“…To discover erroneous behaviors in NLP software, researchers have designed various software testing techniques [10,25,40,63,81]. A test case for NLP software is in the form of a text (e.g., a sentence) and its label, where the label is the expected correct output of the NLP software.…”

Section: Introductionmentioning

confidence: 99%

AEON: A Method for Automatic Evaluation of NLP Test Cases

Huang¹,

Zhang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON's potential in improving NLP software. CCS CONCEPTS• Software and its engineering → Software testing and debugging.

show abstract

Validation on machine reading comprehension software without annotated labels: a property-based method

Cited by 21 publications

References 41 publications

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

MDPFuzz: Testing Models Solving Markov Decision Processes

AEON: A Method for Automatic Evaluation of NLP Test Cases

Contact Info

Product

Resources

About