Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-industry.38
|View full text |Cite
|
Sign up to set email alerts
|

Benchmarking Commercial Intent Detection Services with Practice-Driven Evaluations

Abstract: Intent detection is a key component of modern goal-oriented dialog systems that accomplish a user task by predicting the intent of users' text input. There are three primary challenges in designing robust and accurate intent detection models. First, typical intent detection models require a large amount of labeled data to achieve high accuracy. Unfortunately, in practical scenarios it is more common to find small, unbalanced, and noisy datasets. Secondly, even with large training data, the intent detection mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(16 citation statements)
references
References 10 publications
0
16
0
Order By: Relevance
“…For the three intent classification datasets, in addition to the original evaluation data, we also evaluate on a difficult subset of each test set described in (Qi et al 2021). The difficult subsets are constructed by comparing the TF-IDF vector of each test example to that of the training examples for a given intent.…”
Section: Contrastive Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…For the three intent classification datasets, in addition to the original evaluation data, we also evaluate on a difficult subset of each test set described in (Qi et al 2021). The difficult subsets are constructed by comparing the TF-IDF vector of each test example to that of the training examples for a given intent.…”
Section: Contrastive Learningmentioning
confidence: 99%
“…We also experimented with generating our own difficult subsets in a similar manner using BERT-based sentence encoders 5 , and compare each test example with the mean-pooling of the training examples for that intent. Result shows that the TF-IDF method yields a more challenging subset, thus we report results on the original subsets from Qi et al (2021). The evaluation metric for all intent classification datasets is accuracy.…”
Section: Contrastive Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…Query-document pairs are concatenated and sent through Transformer-based encoders, an additional layer on top of the encoded representation is adopted to produce a relevance score of the document to the query, which is then used for ranking. Arora et al (2020) and Qi et al (2021) benchmark intent detection models on intent detection datasets such as CLINC150 (Larson et al, 2019) where sufficient training examples exist for each intent. On the other hand, our use case focuses on the scenarios where answer text is available but training examples are insufficient.…”
Section: Related Workmentioning
confidence: 99%
“…We also create the fewshot version of these datasets to evaluate the models' performance on small datasets. Additionally, after observing the close accuracy results among the models, we follow Arora et al ( 2020) and Qi et al (2021) to create the TF*IDF and jaccard based difficult testing set to differentiate them better. 4 Overall, our benchmark generates about 1000 data points, including accuracy and training time in default, few-shot training, and difficult testing settings.…”
Section: Introductionmentioning
confidence: 99%