DialFact: A Benchmark for Fact-Checking in Dialogue

Gupta, Prakhar; Wu, Chien-Sheng; Liu, Wenhao; Xiong, Caiming

doi:10.48550/arxiv.2110.08222

Cited by 8 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Honovich et al [72] present a trainable metric for the KGD task, which also applies NLI. It is also noteworthy that Gupta et al [66] propose datasets that can benefit fact-checking systems specialized for dialogue systems. Conv-FEVER corpus [154] is a factual consistency detection dataset, which is created by adapting from Wizard-of-Wikipedia dataset [31].…”

Section: Hallucination Metrics For Generation-based Dialogue Systems ...mentioning

confidence: 99%

“…Fact-checking in dialogue systems. In addition to the factual consistency in responses from knowledge grounded dialogue systems, fact-checking in dialogue systems is a future direction of dealing with the hallucination problem in dialogue system [66]. The dialogue fact-checking involves verifiable claim detection, which is an important line of distinguishing hallucination-prone dialogue, and evidence retrieval from an external source.…”

Section: Future Directions In Dialogue Generationmentioning

confidence: 99%

See 1 more Smart Citation

Survey of Hallucination in Natural Language Generation

Ji¹,

Lee²,

Frieske³

et al. 2022

Preprint

View full text Add to dashboard Cite

Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent natural language generation, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended texts, which degrades the system performance and fail to meet user expectations in many real-world scenarios. In order to address this issue, there have been studies in measuring and mitigating hallucinated texts. However there has not been a comprehensive review of the state-of-the-art in hallucination detection and mitigation.In this survey, we provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; (2) an overview of task-specific research progress for hallucinations in a large set of downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.

show abstract

Section: Hallucination Metrics For Generation-based Dialogue Systems ...mentioning

confidence: 99%

Section: Future Directions In Dialogue Generationmentioning

confidence: 99%

Survey of Hallucination in Natural Language Generation

Ji¹,

Lee²,

Frieske³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Hallucination Evaluation. Recently introduced benchmarks can serve as testbeds for knowledge grounding in dialogue systems, such as BEGIN (Dziri et al, 2021b), DialFact (Gupta et al, 2021), and Attributable to Identified Sources (AIS) framework (Rashkin et al, 2021a). Meanwhile, a recent study has reopened the question of the most reliable metric for automatic evaluation of hallucinationfree models, with the Q 2 metric (Honovich et al, 2021) showing performance comparable to human annotation.…”

Section: Related Workmentioning

confidence: 99%

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Dziri¹,

Kamalloo²,

Milton³

et al. 2022

Preprint

View full text Add to dashboard Cite

The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. Dziri et al. ( 2022)'s investigation of hallucinations has revealed that existing knowledgegrounded benchmarks are contaminated with hallucinated responses at an alarming level (>60% of the responses) and models trained on this data amplify hallucinations even further (>80% of the responses). To mitigate this behavior, we adopt a data-centric solution and create FAITHDIAL, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WOW) benchmark. We observe that FAITHDIAL is more faithful than WoW while also maintaining engaging conversations. We show that FAITHDIAL can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 21.1 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FAITHDIAL generalize to zero-shot transfer on other datasets, such as CMU-DOG and TOPICALCHAT. Finally, human evaluation reveals that responses generated by models trained on FAITHDIAL are perceived as more interpretable, cooperative, and engaging.

show abstract

“…This enables a more well-defined task, since determining the truthfulness of a fact w.r.t a Task # Examples Open Test Cons. Summarization -FRANK (Pagnoni et al, 2021) 671 + 33.2% -SummEval (Fabbri et al, 2021a) 1,600 -81.6% -MNBM (Maynez et al, 2020) 2,500 -10.2% -QAGS-CNNDM 235 -48.1% -QAGS-XSum 239 -48.5% Dialogue -BEGIN (Dziri et al, 2021) 836 + 33.7% -Q 2 (Honovich et al, 2021) 1,088 -57.7% -DialFact (Gupta et al, 2021) 8,689 + 38.5% Fact Verification -FEVER (Thorne et al, 2018) 18,209 -35.1% -VitaminC (Schuster et al, 2021) 63,054 + 49.9% Paraphrasing -PAWS (Zhang et al, 2019) 8,000 + 44.2% general "real world" is subjective and depends on the knowledge, values and beliefs of the subject (Heidegger, 2001). This definition follows similar strictness in Textual Entailment, Question Answering, Summarization and other tasks where comprehension is based on a given grounding text, irrespective of contradiction with other world knowledge.…”

Section: Definitions and Terminologymentioning

confidence: 99%

“…DialFact Gupta et al (2021) introduced the task of fact-verification in dialogue and constructed a dataset of conversational claims paired with pieces of evidence from Wikipedia. They define three tasks: (1) detecting whether a response contains verifiable content (2) retrieving relevant evidence and (3) predicting whether a response is supported by the evidence, refuted by the evidence or if there is not enough information to determine.…”

Section: Dialogue Generationmentioning

confidence: 99%

TRUE: Re-evaluating Factual Consistency Evaluation

Honovich¹,

Aharoni²,

Herzig³

et al. 2022

Preprint

View full text Add to dashboard Cite

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the examplelevel accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level metaevaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods. 1 * Work done during an internship at Google Research.

show abstract

DialFact: A Benchmark for Fact-Checking in Dialogue

Cited by 8 publications

References 29 publications

Survey of Hallucination in Natural Language Generation

Survey of Hallucination in Natural Language Generation

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

TRUE: Re-evaluating Factual Consistency Evaluation

Contact Info

Product

Resources

About