Proceedings of the Web Conference 2021 2021
DOI: 10.1145/3442381.3449805
|View full text |Cite
|
Sign up to set email alerts
|

Crosslingual Topic Modeling with WikiPDA

Abstract: We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using m… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 37 publications
0
3
0
Order By: Relevance
“…When models are trained for Wikipedia-specific tasks, however, there are many language-agnostic features that can potentially be used to boost performance in lower-resourced languages. For example, in training topic models for Wikipedia, Piccardi & West (2021) and Johnson et al (2021a) relied not on the words in an article but the links. Article links can be mapped to language-agnostic Wikidata IDs such that e.g., a link on English Wikipedia to the article for poblano peppers will be represented identically as a link to chiles poblanos 36 .…”
Section: Situated Researchersmentioning
confidence: 99%
“…When models are trained for Wikipedia-specific tasks, however, there are many language-agnostic features that can potentially be used to boost performance in lower-resourced languages. For example, in training topic models for Wikipedia, Piccardi & West (2021) and Johnson et al (2021a) relied not on the words in an article but the links. Article links can be mapped to language-agnostic Wikidata IDs such that e.g., a link on English Wikipedia to the article for poblano peppers will be represented identically as a link to chiles poblanos 36 .…”
Section: Situated Researchersmentioning
confidence: 99%
“…6 Examples of TM applications in digital social sciences and humanities include finding geographic themes from GPS-associated documents on social media platforms such as Flickr and Twitter, 7 selecting news articles on opposition to Euro currency from Financial Times data, 8 identifying paragraphs on epistemological concerns in English and German novels, 9 tracking research trends in different disciplines, 10 and revealing dominant themes in newspapers, 11 governance literature, 12 and Wikipedia entries. 13 Topic modeling was applied in addition to text mining to enhance access to large digital collections by providing minimal description and enriching metadata, including subject headings. 14 Also, a possibility of using topic modeling to determine the subject headings for books on Project Gutenberg was explored.…”
Section: Topic Modeling and Its Applicationsmentioning
confidence: 99%
“…Some approaches use the word-aligned corpus where the topic model is achieved by optimizing the semantic distribution of words [ 22 , 23 ]. The disadvantage is that it is limited by multilingual vocabulary alignment resources [ 24 ]. Other studies are focusing on the document alignment corpus, which utilize large aligned corpora effectively and map multilingual documents to corresponding topic distributions through training [ 25 , 26 , 27 , 28 ].…”
Section: Related Workmentioning
confidence: 99%