Evaluating CETEMPúblico, a free resource for Portuguese

Santos, Diana; Rocha, Paulo

doi:10.3115/1073012.1073070

Cited by 32 publications

(17 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the empirical method, a collection of text attributes is selected being agreed upon by its users. This can be done to increase reference efficiency or for other reasons [6]. One way to evaluate the quality of the corpora is to examine the results of their use in the application, for example, we can use them in a translation machine to evaluate bilinguals, and then we can evaluate the results of the translation using the corpus.…”

Section: The Review Of the Related Literaturementioning

confidence: 99%

The introduction of criteria for assessing an aligned parallel Persian-English corpus at the sentence level

Mashayekhi¹,

Analoui²,

Minaei-Bidgoli³

2019

IJET

View full text Add to dashboard Cite

Bilingual corpora are a collection of writings that serve as an example of the relationship between two languages for linguistic and translational applications. Examining the effectiveness of the corpora is one of the essential requirements for working with them. Therefore, the validity of the work based on the corpora is to check their quality. Scientists have identified four main attributes for corpses. These four features are representativeness, limited size, machine-readable shape, standard reference. To evaluate an entity, we need to evaluate these four properties. The limited size and intelligibility of a machine in electronic compartments are certain because they are otherwise unusable. Representativeness means to put a sample set of language variations for the language in question in the corpus. In fact, the corpus has a linguistic diversity. To evaluate this property, we examine the complexity and diversity of the figure and compute the degree of compliance with Ziff's law. For the standardization of each pair, we combine several of the following characteristics: alignment, translation, command, punctuation, separation, characterization. Finally, a fuzzy system uses the final evaluation of these criteria and uses a fuzzy rule base and fuzzy inputs of the introduced evaluators to obtain a fuzzy result for the quality of the entity.

show abstract

Section: The Review Of the Related Literaturementioning

confidence: 99%

The introduction of criteria for assessing an aligned parallel Persian-English corpus at the sentence level

Mashayekhi¹,

Analoui²,

Minaei-Bidgoli³

2019

IJET

View full text Add to dashboard Cite

show abstract

“…3 It is possible, of course, to evaluate these different parts separately, especially when corpus texts are distributed as raw text, without a corpus encoding system or dedicated interface. In Santos and Rocha (2001) and Santos and Gasperin (2002) some preliminary evaluation of bare corpora is presented, but without special focus on usability. 4 These tools were developed in the context of the AC/DC project (Santos and Sarmento, 2003) and are currently in use to display visits to the Linguateca site.…”

Section: Higher-level Research Questions and The User Class Issuementioning

confidence: 99%

The corpus, its users and their needs

Santos

Frankenberg‐Garcia

2007

IJCL

Self Cite

View full text Add to dashboard Cite

COMPARA is a bidirectional parallel corpus of English and Portuguese, currently with 3 million words. The corpus was launched in 2000 and at present it is possibly the largest edited parallel corpus publicly available on the Web, with roughly 6,000 corpus queries per month. This paper summarizes an analysis of six years of corpus use. We begin by looking at user studies for language resources, especially corpora, and then we provide a snapshot of COMPARA's users and their behaviour based on log analysis. Particular emphasis is given to the language interface preferred by users (Portuguese and English are possible), the choice between the Simple and Complex Search modes, the reasons underlying null-results and behaviour after truncated output. The data has pointed us to cases where COMPARA's Web interface can be improved, and provided insights about our users and the problems they face, although further studies that distinguish between different kinds of users remain necessary.

show abstract

“…For example in the query "o" "ditador" "cubano" "antes" "da" "revolução" (the Cuban dictator before the revolution), the words o and da are discarded while in the query "o ditador cubano antes da revolução" (phrase pattern) they are not discarded. Last year the 22 most frequent words in the CETEMPúblico corpus [8] were discarded. This year in addition to those, some other words were discarded.…”

Section: Onde ([^\S?]*) ([^?]*)\??/"$2 $1"/20mentioning

confidence: 99%

20th Century Esfinge (Sphinx) Solving the Riddles at CLEF 2005

Costa

2006

Accessing Multilingual Information Repositories

View full text Add to dashboard Cite

Abstract. Esfinge is a general domain Portuguese question answering system. It tries to apply simple techniques to large amounts of text. Esfinge participated last year in the monolingual QA track, but the results were compromised by several basic errors. This year, participation was intended to correct the basic errors of last year and work for the first time in the multilingual QA track. Esfinge overviewThe sphinx in the Egyptian/Greek mythology was a demon of destruction that sat outside Thebes and asked riddles to all passers-by. She strangled all the people unable to answer [1], but the times have changed and now Esfinge has to answer questions herself. Fortunately, CLEF's organization is much more benevolent when analysing the results of the QA task. performance Esfinge (http://acdc.linguateca.pt/Esfinge/) is a question answering system developed for the Portuguese which is based on the architecture proposed by Eric Brill [2]. Brill suggests that it is possible to get state of the art results, applying simple techniques to large quantities of data.Esfinge starts by converting a question into patterns of plausible answers. These patterns are queried in several text collections (CLEF text collections and the Web) to obtain snippets of text where the answers are likely to be found.Then, the system harvests these snippets for word N-grams. The N-grams will be later ranked according to their frequency, length and the patterns used to recover the snippets where the N-grams were found (these patterns are scored a priori). Several simple techniques are used to discard or enhance the score of each of the Ngrams. Finally the answer will be the top ranked N-gram or NIL if neither of the N-grams passes all the filters. Strategies for CLEF 2005During last year participation, several problems compromised the results. The main objectives for this year were to correct these problems, and to participate in the multilingual tasks.This year, in addition to the European Portuguese text collection (Público), the organization also provided a Brazilian Portuguese collection (Folha). This new collection helped Esfinge, since one of the problems encountered last year was precisely that the document collection only had texts written in the European variant and some of the answers discovered by the system were in the Brazilian variant, therefore difficult to justify [3]. Corpus Workbench [4] was used again to encode the document collections. Each document was divided in sets of three sentences. Last year other text unit sizes were tried (namely 50 contiguous words and one sentence), but the results using three sentence sets were slightly better. The sentence segmentation and tokenization was done using the Perl Module Lingua::PT::PLNbase developed at Linguateca and freely available at CPAN. For the English documents, the sentence segmentation and tokenization programs used by DISPARA in the COMPARA project [5] were used. Pre-processing IMS

show abstract

Evaluating CETEMPúblico, a free resource for Portuguese

Cited by 32 publications

References 0 publications

The introduction of criteria for assessing an aligned parallel Persian-English corpus at the sentence level

The introduction of criteria for assessing an aligned parallel Persian-English corpus at the sentence level

The corpus, its users and their needs

20th Century Esfinge (Sphinx) Solving the Riddles at CLEF 2005

Contact Info

Product

Resources

About