Proceedings of the First Workshop on NLP for Conversational AI 2019
DOI: 10.18653/v1/w19-4101
|View full text |Cite
|
Sign up to set email alerts
|

A Repository of Conversational Datasets

Abstract: Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches. To this end, we present a repository of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models using 1-of-100 accuracy. The repository contains scripts that allow researchers to reproduce the standard datasets, or to adapt the pre-processing and data filtering… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
88
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 67 publications
(89 citation statements)
references
References 40 publications
1
88
0
Order By: Relevance
“…The final layer is linear and maps the text into the final l-dimensional (l = 512) representation: h c and h r . Other standard and more sophisticated encoder models can also be used to provide final encodings h c and h r , but the current architecture shows a good trade-off between speed and efficacy with strong and robust performance in our empirical evaluations on the response retrieval task using Reddit (Al-Rfou et al, 2016), OpenSubtitles (Lison and Tiedemann, 2016), and AmazonQA (Wan and McAuley, 2016) conversational test data, see (Henderson et al, 2019a) for further details. 3 In training the constant C is constrained to lie between 0 and √ l. 4 Following Henderson et al (2017), the scoring function in the training objective aims to maximise the similarity score of context-reply pairs that go together, while minimising the score of random pairings: negative examples.…”
Section: Polyresponse: Conversational Searchmentioning
confidence: 99%
See 2 more Smart Citations
“…The final layer is linear and maps the text into the final l-dimensional (l = 512) representation: h c and h r . Other standard and more sophisticated encoder models can also be used to provide final encodings h c and h r , but the current architecture shows a good trade-off between speed and efficacy with strong and robust performance in our empirical evaluations on the response retrieval task using Reddit (Al-Rfou et al, 2016), OpenSubtitles (Lison and Tiedemann, 2016), and AmazonQA (Wan and McAuley, 2016) conversational test data, see (Henderson et al, 2019a) for further details. 3 In training the constant C is constrained to lie between 0 and √ l. 4 Following Henderson et al (2017), the scoring function in the training objective aims to maximise the similarity score of context-reply pairs that go together, while minimising the score of random pairings: negative examples.…”
Section: Polyresponse: Conversational Searchmentioning
confidence: 99%
“…5 The pretrained model downloaded from TensorFlow Slim. (Henderson et al, 2019a). We preprocess the dataset to remove uninformative and long comments by retaining only sentences containing more than 8 and less than 128 word tokens.…”
Section: Polyresponse: Conversational Searchmentioning
confidence: 99%
See 1 more Smart Citation
“…We hope that this paper will inform future development of response-based taskoriented dialogue. Training and test datasets, described in more detail by Henderson et al (2019), are available at: github.com/ PolyAI-LDN/conversational-datasets.…”
Section: Input Candidate Responsesmentioning
confidence: 99%
“…Reddit Data. Our pretraining method is based on the large Reddit dataset compiled and made publicly available recently by Henderson et al (2019). This dataset is suitable for response selection pretraining due to multiple reasons as discussed by Al-Rfou et al (2016).…”
Section: Step 1: Response Selection Pretrainingmentioning
confidence: 99%