Testing Your Question Answering Software via Asking Recursively

Chen, Songqiang; Jin, Shuo; Xie, Xue‐Jun

doi:10.1109/ase51524.2021.9678670

Cited by 26 publications

(6 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper extends our preliminary work Chen et al (2021a) that is presented as a research paper at the ASE 2021 conference. Particularly, this paper enriches the existing preliminary work in the following aspects:…”

Section: Introductionsupporting

confidence: 66%

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

2023

Self Cite

View full text Add to dashboard Cite

Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate the performance of QA software, many benchmark datasets have been constructed to provide various test cases. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases are mandatory to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this work, we propose a novel testing method, qaAskeR + , with five new Metamorphic Relations for QA software. qaAskeR + does not refer to the annotated labels of test cases. Instead, based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer, qaAskeR + tests QA software by inspecting its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. Experimental results show that qaAskeR + can reveal quite a few violations that indicate actual answering issues on various mainstream QA software without using any pre-annotated labels.

show abstract

Section: Introductionsupporting

confidence: 66%

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

2023

Self Cite

View full text Add to dashboard Cite

show abstract

“…This approach benefits from existing techniques for evaluating question-answer (QA) language models [12,30,59,78]. Multiple studies proposed reliability testing techniques for the QA software [11,20,34,47,59,60]. Among them, Shen et al [60] outperformed previous state-of-the-art approaches by discovering 23% more bugs or inconsistencies in the target answer-generating models and the Challenger component adopted their question mutation technique for generating challenging questions.…”

Section: Mutationmentioning

confidence: 99%

ChatGPT Incorrectness Detection in Software Reviews

Tanzil,

Khan,

Uddin

2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 -0.75. CCS CONCEPTS• Computing methodologies → Natural language processing; • Software and its engineering → Software testing and debugging.

show abstract

“…The biggest difference between traditional testing techniques and metamorphic testing is that MT only examines the violation of MRs on groups of source and follow-up outputs rather than checking the correctness of outputs. Due to this benefit, MT has been extended to a wide range of software activities, such as software validation [20][21][22], fault localization [23], AI system testing [24] and QA system assessing [25,26].…”

Section: Metamorphic Testingmentioning

confidence: 99%

“…The central component of MT is MRs, which encode the necessary properties of the target program in relation to multiple inputs and their expected outputs. In recent years, MT has been extended to a wide range of software activities, such as software validation [20][21][22], fault localization [23], AI system testing [24] and QA system assessing [25,26].…”

Section: Introductionmentioning

confidence: 99%

An Efficient Detection Model for Smart Contract Reentrancy Vulnerabilities

Li¹,

Guo²,

Wang³

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Despite the rapid growth of smart contracts, they are suffering numerous security vulnerabilities due to the absence of reliable development and testing. In this article, we apply the metamorphic testing technique to detect smart contract vulnerabilities. Based on the anomalies we observed in vulnerable smart contracts, we define five metamorphic relations to detect abnormal gas consumption and account interaction inconsistency of the target smart contract. Through dynamically executing transactions and checking the final violation of metamorphic relations, we determine whether a smart contract is vulnerable. We evaluate our approach on a benchmark of 67 manually annotated smart contracts. The experimental results show that our approach achieves a higher detection rate (TPR, true positive rate) with a lower misreport rate (FDR, false discovery rate) than the other three state-of-the-art tools. These results further suggest that metamorphic testing is a promising method for detecting smart contract vulnerabilities.

show abstract

Testing Your Question Answering Software via Asking Recursively

Cited by 26 publications

References 32 publications

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

qaAskeR$$^+$$: a novel testing method for question answering software via asking recursive questions

ChatGPT Incorrectness Detection in Software Reviews

An Efficient Detection Model for Smart Contract Reentrancy Vulnerabilities

Contact Info

Product

Resources

About