A Golden Resource for Named Entity Recognition in Portuguese

Santos, Diana; Cardoso, Nuno

doi:10.1007/11751984_8

Cited by 15 publications

(11 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The datasets are crucial for the success of any Machine Learning work, but the NER task for the Portuguese language presents several problems due to the lack of training and testing datasets. The only freely available Portuguese dataset annotated with classes of entities was the one developed for the HAREM events [24]. One other Portuguese dataset is the SIGARRA News Corpus, annotated for named entities, consisting of a set of 905 news manually annotated (https://hdl.handle.net/10216/106094), which was taken from the SIGARRA information system at the University of Porto (https://sigarra.up.pt).…”

Section: Experiments and Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Named Entity Recognition for Sensitive Data Discovery in Portuguese

et al. 2020

View full text Add to dashboard Cite

The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

“…We studied a neural network approach, in which a Bidirectional-LSTM was chosen for the different approaches implemented, and the used corpora were HAREM golden Collection [24] and SIGARRA News Corpus [25].…”

mentioning

confidence: 99%

Named Entity Recognition for Sensitive Data Discovery in Portuguese

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Using linguistic rules, it aims to recognize relations of identity, inclusion and occurrence between previously identified entities. SeRELeP and its auxiliary package SeRELeP Tools were influenced by HAREM 5 directives [10,11]. The relations it intends to extract are some of the ones proposed by HAREM, and so are the rules for them to be extracted.…”

Section: Serelepmentioning

confidence: 99%

SeRELeP-Olympics

Bruckschen¹,

Vieira

Rigo³

2008

Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web

View full text Add to dashboard Cite

show abstract

“…In HAREM we took the classification process a step further by including a morphological task where NEs were assigned their respective gender and number in context. Another feature worth noting is that semantic classification was divided in two conceptual steps (categories and types), in order to more precisely pin down the intended meaning of the NE (see [4] for details). relevant in evaluation contests as are baselines.…”

Section: Introductionmentioning

confidence: 99%

“…relevant in evaluation contests as are baselines. Therefore, we were extremely careful in maintaining vagueness during manual annotation of the golden collection either by allowing entities to encompass several semantic categories or by employing the ALT tag which permitted alternative delimitations of the NEs (see [4] for details). Another interesting note is that we also allowed participants to choose a set of categories and types that they wanted to be evaluated in (the selective scenario).…”

Section: Introductionmentioning

confidence: 99%