Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining 2019
DOI: 10.1145/3289600.3291023
|View full text |Cite
|
Sign up to set email alerts
|

Crosslingual Document Embedding as Reduced-Rank Ridge Regression

Abstract: There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 34 publications
2
11
0
Order By: Relevance
“…We further note that the relatively good performance of ridge regression and random forest classifiers on the development data without any hyperparameter fine-tuning may possibly be explained by the relative robustness of both methods to sparseness and collinearity effects (which severely affect other types of parametric and non-parametric predictors in our experiment), as previously analyzed in detail by Tomaschek et al (2018) and observed by others in a typological setting (Burdick et al, 2020) and elsewhere (Josifoski et al, 2019). We briefly reevaluate these assumptions on the released test set in Section 4.6.…”
Section: Model Selection Using Development Setsupporting
confidence: 65%
“…We further note that the relatively good performance of ridge regression and random forest classifiers on the development data without any hyperparameter fine-tuning may possibly be explained by the relative robustness of both methods to sparseness and collinearity effects (which severely affect other types of parametric and non-parametric predictors in our experiment), as previously analyzed in detail by Tomaschek et al (2018) and observed by others in a typological setting (Burdick et al, 2020) and elsewhere (Josifoski et al, 2019). We briefly reevaluate these assumptions on the released test set in Section 4.6.…”
Section: Model Selection Using Development Setsupporting
confidence: 65%
“…We further note that the relatively good per formance of ridge regression and random forest classifiers on the development data without any hyperparameter finetuning may possibly be ex plained by the relative robustness of both meth ods to sparseness and collinearity effects (which severely affect other types of parametric and non parametric predictors in our experiment), as pre viously analyzed in detail by Tomaschek et al (2018) and observed by others in a typological set ting (Burdick et al, 2020) and elsewhere (Josifoski et al, 2019). We briefly reevaluate these assump tions on the released test set in Section 4.6.…”
Section: Model Selection Using Development Setsupporting
confidence: 56%
“…We presented Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a novel crosslingual topic model for Wikipedia. Our human evaluation showed that the topics learned from 28 languages are as coherent as those learned from English alone, and more coherent than those from text-based LDA on English, a noteworthy finding, given that other crosslingual tasks have suffered by adding languages to the training set (Josifoski et al, 2019). We demonstrated WikiPDA's practical utility in several example applications and highlighted its capability for zero-shot language transfer.…”
Section: Discussionmentioning
confidence: 63%
“…Comparing model classes 1 and 3, we find that the dense WikiPDA model for 28 languages performed indistinguishably from the dense model for English only; i.e., adding more languages did not make the topics less coherent. This outcome is noteworthy, since on other crosslingual tasks (e.g., document retrieval), performance on a fixed testing language decreased when adding languages to the training set (Josifoski et al, 2019).…”
Section: Topic Modelingmentioning
confidence: 91%