2006
DOI: 10.1007/11751984_8
|View full text |Cite
|
Sign up to set email alerts
|

A Golden Resource for Named Entity Recognition in Portuguese

Abstract: Abstract. This paper presents a collection of texts manually annotated with named entities in context, which was used for HAREM, the first evaluation contest for named entity recognizers for Portuguese. We discuss the options taken and the originality of our approach compared with previous evaluation initiatives in the area. We document the choice of categories, their quantitative weight in the overall collection and how we deal with vagueness and underspecification.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
7
0
3

Year Published

2006
2006
2022
2022

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 12 publications
1
7
0
3
Order By: Relevance
“…The datasets are crucial for the success of any Machine Learning work, but the NER task for the Portuguese language presents several problems due to the lack of training and testing datasets. The only freely available Portuguese dataset annotated with classes of entities was the one developed for the HAREM events [24]. One other Portuguese dataset is the SIGARRA News Corpus, annotated for named entities, consisting of a set of 905 news manually annotated (https://hdl.handle.net/10216/106094), which was taken from the SIGARRA information system at the University of Porto (https://sigarra.up.pt).…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The datasets are crucial for the success of any Machine Learning work, but the NER task for the Portuguese language presents several problems due to the lack of training and testing datasets. The only freely available Portuguese dataset annotated with classes of entities was the one developed for the HAREM events [24]. One other Portuguese dataset is the SIGARRA News Corpus, annotated for named entities, consisting of a set of 905 news manually annotated (https://hdl.handle.net/10216/106094), which was taken from the SIGARRA information system at the University of Porto (https://sigarra.up.pt).…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…We studied a neural network approach, in which a Bidirectional-LSTM was chosen for the different approaches implemented, and the used corpora were HAREM golden Collection [24] and SIGARRA News Corpus [25].…”
mentioning
confidence: 99%
“…Using linguistic rules, it aims to recognize relations of identity, inclusion and occurrence between previously identified entities. SeRELeP and its auxiliary package SeRELeP Tools were influenced by HAREM 5 directives [10,11]. The relations it intends to extract are some of the ones proposed by HAREM, and so are the rules for them to be extracted.…”
Section: Serelepmentioning
confidence: 99%
“…In HAREM we took the classification process a step further by including a morphological task where NEs were assigned their respective gender and number in context. Another feature worth noting is that semantic classification was divided in two conceptual steps (categories and types), in order to more precisely pin down the intended meaning of the NE (see [4] for details). relevant in evaluation contests as are baselines.…”
Section: Introductionmentioning
confidence: 99%
“…relevant in evaluation contests as are baselines. Therefore, we were extremely careful in maintaining vagueness during manual annotation of the golden collection either by allowing entities to encompass several semantic categories or by employing the ALT tag which permitted alternative delimitations of the NEs (see [4] for details). Another interesting note is that we also allowed participants to choose a set of categories and types that they wanted to be evaluated in (the selective scenario).…”
Section: Introductionmentioning
confidence: 99%