In this paper, we investigate the novel problem of automatic question identification in the microblog environment. It contains two steps: detecting tweets that contain questions (we call them "interrogative tweets") and extracting the tweets which really seek information or ask for help (so called "qweets") from interrogative tweets. To detect interrogative tweets, both traditional rule-based approach and state-of-the-art learning-based method are employed. To extract qweets, context features like short urls and Tweetspecific features like Retweets are elaborately selected for classification. We conduct an empirical study with sampled one hour's English tweets and report our experimental results for question identification on Twitter.
Abstract-Entity disambiguation with a knowledge base becomes increasingly popular in the NLP community. In this paper, we employ Freebase as the knowledge base, which contains significantly more entities than Wikipedia and others. While huge in size, Freebase lacks context for most entities, such as the descriptive text and hyperlinks in Wikipedia, which are useful for disambiguation. Instead, we leverage two features of Freebase, namely the naturally disambiguated mention phrases (aka aliases) and the rich taxonomy, to perform disambiguation in an iterative manner. Specifically, we explore both generative and discriminative models for each iteration. Experiments on 2, 430, 707 English sentences and 33, 743 Freebase entities show the effectiveness of the two features, where 90% accuracy can be reached without any labeled data. We also show that discriminative models with proposed split training strategy is robust against overfitting problem, and constantly outperforms the generative ones.
Q&A sites continue to flourish as a large number of users rely on them as useful substitutes for incomplete or missing search results. In this paper, we present our experience with developing Confucius, a Google Q&A service launched in 21 countries and four languages by the end of 2009. Confucius employs six data mining subroutines to harness synergy between web search and social networks. We present these subroutines' design goals, algorithms, and their effects on service quality. We also describe techniques for and experience with scaling the subroutines to mine massive data sets.
We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting (MQR) dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points across three aspects. We train sequence-to-sequence neural models on the constructed dataset and obtain an improvement of 13.2% in BLEU-4 over baseline methods built from other data resources. We release the MQR dataset to encourage research on the problem of question rewriting.1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.