Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 2001
DOI: 10.3115/1073012.1073070
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating CETEMPúblico, a free resource for Portuguese

Abstract: In this paper we present a thorough evaluation of a corpus resource for Portuguese, CETEMPúblico, a 180million word newspaper corpus free for R&D in Portuguese processing. We provide information that should be useful to those using the resource, and to considerable improvement for later versions. In addition, we think that the procedures presented can be of interest for the larger NLP community, since corpus evaluation and description is unfortunately not a common exercise.,QWURGXFWLRQCETEMPúblico is a large c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0
5

Year Published

2002
2002
2019
2019

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 32 publications
(17 citation statements)
references
References 0 publications
0
12
0
5
Order By: Relevance
“…In the empirical method, a collection of text attributes is selected being agreed upon by its users. This can be done to increase reference efficiency or for other reasons [6]. One way to evaluate the quality of the corpora is to examine the results of their use in the application, for example, we can use them in a translation machine to evaluate bilinguals, and then we can evaluate the results of the translation using the corpus.…”
Section: The Review Of the Related Literaturementioning
confidence: 99%
“…In the empirical method, a collection of text attributes is selected being agreed upon by its users. This can be done to increase reference efficiency or for other reasons [6]. One way to evaluate the quality of the corpora is to examine the results of their use in the application, for example, we can use them in a translation machine to evaluate bilinguals, and then we can evaluate the results of the translation using the corpus.…”
Section: The Review Of the Related Literaturementioning
confidence: 99%
“…3 It is possible, of course, to evaluate these different parts separately, especially when corpus texts are distributed as raw text, without a corpus encoding system or dedicated interface. In Santos and Rocha (2001) and Santos and Gasperin (2002) some preliminary evaluation of bare corpora is presented, but without special focus on usability. 4 These tools were developed in the context of the AC/DC project (Santos and Sarmento, 2003) and are currently in use to display visits to the Linguateca site.…”
Section: Higher-level Research Questions and The User Class Issuementioning
confidence: 99%
“…For example in the query "o" "ditador" "cubano" "antes" "da" "revolução" (the Cuban dictator before the revolution), the words o and da are discarded while in the query "o ditador cubano antes da revolução" (phrase pattern) they are not discarded. Last year the 22 most frequent words in the CETEMPúblico corpus [8] were discarded. This year in addition to those, some other words were discarded.…”
Section: Onde ([^\S?]*) ([^?]*)\??/"$2 $1"/20mentioning
confidence: 99%