2021
DOI: 10.48550/arxiv.2108.10755
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

Jin Cheevaprawatdomrong,
Alexandra Schofield,
Attapol T. Rutherford

Abstract: Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared (χ 2 ) test, tstatistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The χ 2 , t and WPE tokenizers are trained on Wikipedia text to look for words … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 8 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?