Creating the DISEQuA Corpus: A Test Set for Multilingual Question Answering

Magnini, Bernardo; Romagnoli, Simone; Vallin, Alessandro; Herrera, Jesús; Peñas, Anselmo; Peinado, Víctor; Verdejo, Felisa; Rijke, Maarten de

doi:10.1007/978-3-540-30222-3_47

Cited by 13 publications

(8 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Corpora developed for multilingual and crosslingual question-answering (QA), information retrieval (IR), and information extraction (IE) tasks are typically compilations of documents on related subjects written in different languages. Documents in such corpora rarely have counterparts in all the languages presented in the corpus (CLEF, 2000;Magnini et al, 2003).…”

Section: Related Workmentioning

confidence: 99%

Directions for exploiting asymmetries in multilingual Wikipedia

Filatova¹

2009

Proceedings of the Third International Workshop on Cross Lingual Information Access Addressing the Information Need of Multilin

View full text Add to dashboard Cite

Multilingual Wikipedia has been used extensively for a variety Natural Language Processing (NLP) tasks. Many Wikipedia entries (people, locations, events, etc.) have descriptions in several languages. These descriptions, however, are not identical. On the contrary, descriptions in different languages created for the same Wikipedia entry can vary greatly in terms of description length and information choice. Keeping these peculiarities in mind is necessary while using multilingual Wikipedia as a corpus for training and testing NLP applications. In this paper we present preliminary results on quantifying Wikipedia multilinguality. Our results support the observation about the substantial variation in descriptions of Wikipedia entries created in different languages. However, we believe that asymmetries in multilingual Wikipedia do not make Wikipedia an undesirable corpus for NLP applications training. On the contrary, we outline research directions that can utilize multilingual Wikipedia asymmetries to bridge the communication gaps in multilingual societies.

show abstract

Section: Related Workmentioning

confidence: 99%

Directions for exploiting asymmetries in multilingual Wikipedia

Filatova¹

2009

Proceedings of the Third International Workshop on Cross Lingual Information Access Addressing the Information Need of Multilin

View full text Add to dashboard Cite

show abstract

“…The data set used in this work consists of the questions provided in the DISEQuA Corpus [10]. Such corpus was made up of simple, mostly short, straightforward and factual queries that sound naturally spontaneous, and arisen from a real desire to know something about a particular event or situation.…”

Section: Data Setsmentioning

confidence: 99%

Question Classification in Spanish and Portuguese

Solorio

Pérez-Coutiño

Montes-y-Gómez

et al. 2005

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

show abstract

“…The data set used in this work consists of the questions provided in the DISEQuA Corpus (Magnini et al, 2003). Such corpus was made up of simple, mostly short, straightforward and factual queries that sound naturally spontaneous, and arisen from a real desire to know something about a particular event or situation.…”

Section: Data Setsmentioning

confidence: 99%

A language independent method for question classification

Solorio

Pérez-Coutiño

Montes-y-Gémez

et al. 2004

Proceedings of the 20th International Conference on Computational Linguistics - COLING '04

View full text Add to dashboard Cite

Abstractsmall Previous works on question classification are based on complex natural language processing techniques: named entity extractors, parsers, chunkers, etc. While these approaches have proven to be effective they have the disadvantage of being targeted to a particular language. We present here a simple approach that exploits lexical features and Internet to train a classifier, in particular a Support Vector Machine. The main feature of this method is that it can be applied to different languages without requiring major adaptation changes. Experimental results of this method on English, Italian and Spanish show that this approach can be a practical tool for question answering systems reaching classification accuracy as high as 88.92%.

show abstract

Creating the DISEQuA Corpus: A Test Set for Multilingual Question Answering

Cited by 13 publications

References 1 publication

Directions for exploiting asymmetries in multilingual Wikipedia

Directions for exploiting asymmetries in multilingual Wikipedia

Question Classification in Spanish and Portuguese

A language independent method for question classification

Contact Info

Product

Resources

About