Abstract. Esfinge is a general domain Portuguese question answering system. It tries to apply simple techniques to large amounts of text. Esfinge participated last year in the monolingual QA track, but the results were compromised by several basic errors. This year, participation was intended to correct the basic errors of last year and work for the first time in the multilingual QA track.
Esfinge overviewThe sphinx in the Egyptian/Greek mythology was a demon of destruction that sat outside Thebes and asked riddles to all passers-by. She strangled all the people unable to answer [1], but the times have changed and now Esfinge has to answer questions herself. Fortunately, CLEF's organization is much more benevolent when analysing the results of the QA task. performance Esfinge (http://acdc.linguateca.pt/Esfinge/) is a question answering system developed for the Portuguese which is based on the architecture proposed by Eric Brill [2]. Brill suggests that it is possible to get state of the art results, applying simple techniques to large quantities of data.Esfinge starts by converting a question into patterns of plausible answers. These patterns are queried in several text collections (CLEF text collections and the Web) to obtain snippets of text where the answers are likely to be found.Then, the system harvests these snippets for word N-grams. The N-grams will be later ranked according to their frequency, length and the patterns used to recover the snippets where the N-grams were found (these patterns are scored a priori). Several simple techniques are used to discard or enhance the score of each of the Ngrams. Finally the answer will be the top ranked N-gram or NIL if neither of the N-grams passes all the filters.
Strategies for CLEF 2005During last year participation, several problems compromised the results. The main objectives for this year were to correct these problems, and to participate in the multilingual tasks.This year, in addition to the European Portuguese text collection (Público), the organization also provided a Brazilian Portuguese collection (Folha). This new collection helped Esfinge, since one of the problems encountered last year was precisely that the document collection only had texts written in the European variant and some of the answers discovered by the system were in the Brazilian variant, therefore difficult to justify [3]. Corpus Workbench [4] was used again to encode the document collections. Each document was divided in sets of three sentences. Last year other text unit sizes were tried (namely 50 contiguous words and one sentence), but the results using three sentence sets were slightly better. The sentence segmentation and tokenization was done using the Perl Module Lingua::PT::PLNbase developed at Linguateca and freely available at CPAN. For the English documents, the sentence segmentation and tokenization programs used by DISPARA in the COMPARA project [5] were used.
Pre-processing
IMS