2017
DOI: 10.11649/cs.1468
|View full text |Cite
|
Sign up to set email alerts
|

Testing word embeddings for Polish

Abstract: Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skipgram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
24
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 13 publications
(26 citation statements)
references
References 26 publications
2
24
0
Order By: Relevance
“…Considering also the dynamic nature of news content and changing user interests, we verify this claim and use a Word2Vec model trained on a corpora consisting of Polish Wikipedia and National Corpus of Polish Language NKJP [25] to compare with our models built on a much smaller custom corpus with the same model parameters. As the text preprocessing tools used by [3] are different from ours, we compare models trained on all word forms.…”
Section: Data Descriptionmentioning
confidence: 90%
See 3 more Smart Citations
“…Considering also the dynamic nature of news content and changing user interests, we verify this claim and use a Word2Vec model trained on a corpora consisting of Polish Wikipedia and National Corpus of Polish Language NKJP [25] to compare with our models built on a much smaller custom corpus with the same model parameters. As the text preprocessing tools used by [3] are different from ours, we compare models trained on all word forms.…”
Section: Data Descriptionmentioning
confidence: 90%
“…The goal is to verify if using a pre-trained Word2Vec representation on a large external corpus results in a better performance than a much smaller custom text collection. We compare the results for the model published by [25] trained on a large corpus of the Polish language with a custom representation trained on our corpus with the same model parameters.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Such space can have the sense of words ''meanings'' or ''styles'', depending on details of the process of embeddings production. Following a common practice, we use off the shelf embbedings made from a large national corpus of literary and official documents [38]. Consequently, each sample Y gets translated into matrix Z 30×100 of embeddings of the first thirty title words.…”
Section: E Title Words Embedding Modelmentioning
confidence: 99%