2020
DOI: 10.1609/aaai.v34i05.6258
|View full text |Cite
|
Sign up to set email alerts
|

How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions

Abstract: We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting (MQR) dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points ac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…The Quora dataset contains similar yet differently expressed question pairs from the Quora online Q&A forum. The multi-domain question rewriting (MQR) (Chu et al 2020) dataset consists of ill-formed and well-formed question pairs, for example, "Spaghetti carbonara, mixing" is paired with "How to mix a spaghetti carbonara?". Lastly, we leverage an internal dataset of 300k query logs produced by human agents in the financial QA setting for out-of-sample experiments.…”
Section: Methodology Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…The Quora dataset contains similar yet differently expressed question pairs from the Quora online Q&A forum. The multi-domain question rewriting (MQR) (Chu et al 2020) dataset consists of ill-formed and well-formed question pairs, for example, "Spaghetti carbonara, mixing" is paired with "How to mix a spaghetti carbonara?". Lastly, we leverage an internal dataset of 300k query logs produced by human agents in the financial QA setting for out-of-sample experiments.…”
Section: Methodology Datasetsmentioning
confidence: 99%
“…Then the MQR dataset is used to further fine-tune the model. The MQR work (Chu et al 2020) suggests that Quora and MQR create a good combination for improving query qualities. This fits our purpose of transforming noisy queries into more fluent reformulations.…”
Section: T5 Frameworkmentioning
confidence: 99%
“…Faruqui and Das annotate the Paralex dataset (Fader et al, 2013) on the well-formedness of the questions. The majority of research efforts have been aimed at reformulating user queries to elicit the best possible answer from the QA system (Yang et al, 2014;Buck et al, 2017;Chu et al, 2019). A complementary line of work uses hate speech detection techniques (Gupta et al, 2020) to filter questions that incite hate on the basis of race, religion, etc.…”
Section: Related Workmentioning
confidence: 99%
“…We construct the document-category pair dataset by pairing question titles or descriptions with their corresponding subareas. Question titles, descriptions and subareas are available from Chu et al (2020). Many Stack Exchange subareas have their own corresponding "meta" sites.…”
Section: Natcat Datasetmentioning
confidence: 99%