SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Lee, Harrison; Gupta, Raghav; Rastogi, Abhinav; Cao, Yuan; Zhang, Bin; Wu, Yonghui

doi:10.1609/aaai.v36i10.21341

Cited by 7 publications

(25 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since SGSAcc uses schema information to construct candidate references, we also validated that SGSAcc is robust to different schema writing styles, as shown by the consistently high F1-score (>0.95) on distinguishing faithful and unfaithful utterances with the rephrased schema in the SGD-X extension (Lee et al, 2021) of SGD (See Table 2 in Appendix A).…”

Section: Sgsacc Evaluationmentioning

confidence: 61%

“…Since SGSAcc uses the slot description in service schema to construct entailment reference, we check its robustness to different schema writing styles so that it can be used to evaluate a variety of services with heterogeneous interfaces. We use the SGD-X dataset (Lee et al, 2021), which contains five versions of schema rephrased from the original SGD to test whether SGSAcc is sensitive to writing styles.…”

Section: A Robustness Against Schema Writing Stylesmentioning

confidence: 99%

See 1 more Smart Citation

Schema-Guided Semantic Accuracy: Faithfulness in Task-Oriented Dialogue Response Generation

Chen¹,

Lin²,

Byrne³

2023

Preprint

View full text Add to dashboard Cite

Ensuring that generated utterances are faithful to dialogue actions is crucial for Task-Oriented Dialogue Response Generation. Slot Error Rate (SER) only partially measures generation quality in that it solely assesses utterances generated from non-categorical slots whose values are expected to be reproduced exactly. Utterances generated from categorical slots, which are more variable, are not assessed by SER. We propose Schema-Guided Semantic Accuracy (SGSAcc) to evaluate utterances generated from both categorical and non-categorical slots by recognizing textual entailment. We show that SGSAcc can be applied to evaluate utterances generated from a wide range of dialogue actions in the Schema Guided Dialogue (SGD) dataset with good agreement with human judgment. We also identify a previously overlooked weakness in generating faithful utterances from categorical slots in unseen domains. We show that prefix tuning applied to T5 generation can address this problem. We further build an ensemble of prefix-tuning and fine-tuning models that achieves the lowest SER reported and high SGSAcc on the SGD dataset.

show abstract

Section: Sgsacc Evaluationmentioning

confidence: 61%

Section: A Robustness Against Schema Writing Stylesmentioning

confidence: 99%

Schema-Guided Semantic Accuracy: Faithfulness in Task-Oriented Dialogue Response Generation

Chen¹,

Lin²,

Byrne³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…A task-specific training (e.g., reserving a table) is performed in the first phase. Task-specific training datasets are generally available for a wide range of tasks in many domains [25,62], whereas personalized counterparts are practically impossible to obtain. To overcome this challenge, we employ the unsupervised personalization phase.…”

Section: Task-specific Trainingmentioning

confidence: 99%

Personalizing Task-oriented Dialog Systems via Zero-shot Generalizable Reward Function

Siddique

Maqbool

Taywade

et al. 2022

Proceedings of the 31st ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Task-oriented dialog systems enable users to accomplish tasks using natural language. State-of-the-art systems respond to users in the same way regardless of their personalities, although personalizing dialogues can lead to higher levels of adoption and better user experiences. Building personalized dialog systems is an important, yet challenging endeavor and only a handful of works took on the challenge. Most existing works rely on supervised learning approaches and require laborious and expensive labeled training data for each user profile. Additionally, collecting and labeling data for each user profile is virtually impossible. In this work, we propose a novel framework, P-ToD, to personalize task-oriented dialog systems capable of adapting to a wide range of user profiles in an unsupervised fashion using a zero-shot generalizable reward function. P-ToD uses a pre-trained GPT-2 as a backbone model and works in three phases. Phase one performs task-specific training. Phase two kicks off unsupervised personalization by leveraging the proximal policy optimization algorithm that performs policy gradients guided by the zero-shot generalizable reward function. Our novel reward function can quantify the quality of the generated responses even for unseen profiles. The optional final phase fine-tunes the personalized model using a few labeled training examples. We conduct extensive experimental analysis using the personalized bAbI dialogue benchmark for five tasks and up to 180 diverse user profiles. The experimental results demonstrate that P-ToD, even when it had access to zero labeled examples, outperforms state-ofthe-art supervised personalization models and achieves competitive performance on BLEU and ROUGE metrics when compared to a strong fully-supervised GPT-2 baseline.

show abstract

“…Large LMs are often sensitive to the choice of prompt (Zhao et al, 2021b;Reynolds and Mc-Donell, 2021). To this end, we evaluate SDT-seq on the SGD-X (Lee et al, 2021b) benchmark, comprising 5 variants with paraphrased slot names and descriptions for every schema (Appendix Figure 4). Note that SDT-seq only makes use of slot names, so variations in description have no effect on it.…”

Section: Robustnessmentioning

confidence: 99%

“…Also, descriptions only provide indirect supervision about how to interact with a service compared to an example. Furthermore, Lee et al (2021b) showed that schema-guided DST models are not robust to variations in schema descriptions, causing significant quality drops.…”

Section: Introductionmentioning

confidence: 99%

Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue

Gupta¹,

Lee²,

Zhao³

et al. 2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

Building universal dialogue systems that operate across multiple domains/APIs and generalize to new ones with minimal overhead is a critical challenge. Recent works have leveraged natural language descriptions of schema elements to enable such systems; however, descriptions only indirectly convey schema semantics. In this work, we propose Show, Don't Tell, which prompts seq2seq models with a labeled example dialogue to show the semantics of schema elements rather than tell the model through descriptions. While requiring similar effort from service developers as generating descriptions, we show that using short examples as schema representations with large language models results in state-of-the-art performance on two popular dialogue state tracking benchmarks designed to measure zeroshot generalization -the Schema-Guided Dialogue dataset and the MultiWOZ leave-oneout benchmark.

show abstract

SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Cited by 7 publications

References 32 publications

Schema-Guided Semantic Accuracy: Faithfulness in Task-Oriented Dialogue Response Generation

Schema-Guided Semantic Accuracy: Faithfulness in Task-Oriented Dialogue Response Generation

Personalizing Task-oriented Dialog Systems via Zero-shot Generalizable Reward Function

Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue

Contact Info

Product

Resources

About