Proceedings of the 25th International Conference Companion on World Wide Web - WWW '16 Companion 2016
DOI: 10.1145/2872518.2891111
|View full text |Cite
|
Sign up to set email alerts
|

Cleansing Wikipedia Categories using Centrality

Abstract: We propose a novel general technique aimed at pruning and cleansing the Wikipedia category hierarchy, with a tunable level of aggregation. Our approach is endogenous, since it does not use any information coming from Wikipedia articles, but it is based solely on the user-generated (noisy) Wikipedia category folksonomy itself. We show how the proposed techniques can help reduce the level of noise in the hierarchy and discuss how alternative centrality measures can differently impact on the result.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
17
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 15 publications
0
17
0
Order By: Relevance
“…Each language chapter can define own structure and hierarchy of categories. As a result in some language versions that structure is often too fine-grained to be directly analyzed [65], which may make it difficult to determine the number of possible topics to deal with.…”
Section: Main Topic Classificationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Each language chapter can define own structure and hierarchy of categories. As a result in some language versions that structure is often too fine-grained to be directly analyzed [65], which may make it difficult to determine the number of possible topics to deal with.…”
Section: Main Topic Classificationsmentioning
confidence: 99%
“…As mentioned before, the category structure is a complex and ever-changing, as it can be edited by 208 any person -users can add or change a category assignment to other category. The resulting category 209 structure is noisy [64], sparse and it contains duplications and oversights [65]. So, we can also face 210 the situation that categories are repeated at different levels of the tree, in which the root can be a 211 different other main categories (one of the 27 considered).…”
mentioning
confidence: 99%
“…Concretely, we extract from DBPedia, for each article, its top-level type in the DBPedia type hierarchy (there are 55 top-level types). 5 For each category, we then construct a type histogram, which summarizes the DBPedia types of the articles contained in the category, and model the homogeneity, or purity, of the category as the Gini coefficient of its type histogram. A low Gini coefficient means that a histogram distributes its probability mass more evenly over the 55 DBPedia types, which indicates an impure, non-ontological category.…”
Section: Cleaning the Category Networkmentioning
confidence: 99%
“…and hierarchy of categories. Moreover, in some language versions that structure is often too fine-grained to be directly analyzed [15]. All this may make it difficult to determine the number of possible topics to deal with.…”
Section: Introductionmentioning
confidence: 99%
“…As mentioned before, the category structure is a complex and ever-changing, as it can be edited by any person-users can add or change a category assignment to another category. The resulting category structure is noisy [14], sparse and it contains duplications and oversights [15]. So, we can also face the situation that categories are repeated at different levels of the tree, in which the root can be another main category (one of the 27 considered).…”
Section: Introductionmentioning
confidence: 99%