Proceedings of the 12th Workshop on Multiword Expressions 2016
DOI: 10.18653/v1/w16-1806
|View full text |Cite
|
Sign up to set email alerts
|

Accounting ngrams and multi-word terms can improve topic models

Abstract: The paper presents an empirical study of integrating ngrams and multi-word terms into topic models, while maintaining similarities between them and words based on their component structure. First, we adapt the PLSA-SIM algorithm to the more widespread LDA model and ngrams. Then we propose a novel algorithm LDA-ITER that allows the incorporation of the most suitable ngrams into topic models. The experiments of integrating ngrams and multiword terms conducted on five text collections in different languages and d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
11
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 15 publications
0
11
0
Order By: Relevance
“…As it was found before [15,17], the addition of ngrams without accounting relations between their components considerably worsens the perplexity because of the vocabulary growth (for perplexity the less is the better) and practically does not change other automatic quality measures ( Table 2).…”
Section: Use Of Automatic Measures To Assess Combined Modelsmentioning
confidence: 52%
See 2 more Smart Citations
“…As it was found before [15,17], the addition of ngrams without accounting relations between their components considerably worsens the perplexity because of the vocabulary growth (for perplexity the less is the better) and practically does not change other automatic quality measures ( Table 2).…”
Section: Use Of Automatic Measures To Assess Combined Modelsmentioning
confidence: 52%
“…At the preprocessing step, documents were processed by morphological analyzers. Also, we extracted noun groups as described in [17]. As baselines, we use the unigram LDA topic model and LDA topic model with added 1000 ngrams with maximal NCvalue [21] extracted from the collection under analysis.…”
Section: Use Of Automatic Measures To Assess Combined Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Only bigrams were considered in their study. Nokel and Loukachevitch [14] followed the collocation extraction approach. They modified the parameter estimation method LDA to such that bigrams and unigrams belong to the same topics more often.…”
Section: Related Workmentioning
confidence: 99%
“…More detailed knowledge and syntax related information is given in [13][14][15]. In [1] an empirical study of integrating n-grams and multi-word terms into topic models is presented by the authors while maintaining similarities between them and words based on their component structure with the help of LDA-ITER algorithm by which the most suitable n-grams and multiword terms are incorporated. In [4], the author thoroughly examined various types of MWE encountered in Hindi from machine translation viewpoint.…”
Section: Related Workmentioning
confidence: 99%