2017
DOI: 10.48550/arxiv.1708.06025
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks

Abstract: Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing systems. In this paper, we evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants. We trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. We evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and senten… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0
8

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 19 publications
(32 citation statements)
references
References 9 publications
0
24
0
8
Order By: Relevance
“…Such datasets are usually composed of all the possible combinations of pairs such as Paris : France, Berlin : Germany or Beijing : China. In our evaluation, we use the dataset of Svoboda and Brychcin (2016) for Czech, that of Köper et al (2015) for German, that of Cardellino (2016) for Spanish, that of Venekoski and Vankka (2017) for Finnish, that of Berardi et al (2015) for Italian, the European variant of the dataset proposed by Hartmann et al (2017) for Portuguese and that of Chen et al (2015) for Chinese.…”
Section: Evaluation Datasetsmentioning
confidence: 99%
“…Such datasets are usually composed of all the possible combinations of pairs such as Paris : France, Berlin : Germany or Beijing : China. In our evaluation, we use the dataset of Svoboda and Brychcin (2016) for Czech, that of Köper et al (2015) for German, that of Cardellino (2016) for Spanish, that of Venekoski and Vankka (2017) for Finnish, that of Berardi et al (2015) for Italian, the European variant of the dataset proposed by Hartmann et al (2017) for Portuguese and that of Chen et al (2015) for Chinese.…”
Section: Evaluation Datasetsmentioning
confidence: 99%
“…To do so, we relied upon pretrained language models. Specifically, since the tweets in our dataset were written in Brazilian Portuguese, we downloaded several pretrained Brazilian Portuguese corpora [6], including those trained using two different variations of the Word2Vec algorithm: a) Continuous Bag-of-Words (CBOW), and b) Skip-Gram with Negative Sampling (SGNS), and GloVe [7]. All corpora used 300 dimensions.…”
Section: Methodsmentioning
confidence: 99%
“…The specific words used to calculate purity were derived from the Brazilian Portuguese Moral Foundations Dictionary for Fake News classification [11]. We retained all words from this dictionary if they were present in our pretrained Brazilian Portuguese corpora [6]. In each case, words were selected based upon their presence in these word lists, with words only excluded if they did not appear in our pretrained corpus or the corresponding tweets.…”
Section: Methodsmentioning
confidence: 99%
“…A large Portuguese corpus was gathered and described in Hartmann et al (2017) , including both Brazilian and European variants. It was used for training and evaluating different word embedding models (FastText, Glove, Wang2Vec and Word2Vec).…”
Section: Related Workmentioning
confidence: 99%
“…However, the great majority of such models has been developed for English corpora. It was only in recent years that the research community has also been focusing on other languages with rich morphology and different syntaxes ( Hartmann et al, 2017 ; Rodrigues et al, 2016 ; Sun et al, 2016 ; Svoboda & Beliga, 2017 ; Svoboda & Brychcin, 2016 ; Turian, Ratinov & Bengio, 2010 ). Moreover, large datasets are required to enable achieving good performance.…”
Section: Introductionmentioning
confidence: 99%