Proceedings of the 2013 KDD Cup 2013 Workshop 2013
DOI: 10.1145/2517288.2517295
|View full text |Cite
|
Sign up to set email alerts
|

Effective string processing and matching for author disambiguation

Abstract: Track 2 in KDD Cup 2013 aims at determining duplicated authors in a data set from Microsoft Academic Search. This type of problems appears in many large-scale applications that compile information from different sources. This paper describes our solution developed at National Taiwan University to win the first prize of the competition. We propose an effective name matching framework and realize two implementations. An important strategy in our approach is to consider Chinese and non-Chinese names separately be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
18
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(18 citation statements)
references
References 11 publications
0
18
0
Order By: Relevance
“…It is not surprising, because biographical features are highly effective for name disambiguation. For instance, a set of recent works [6,15] report around 99% accuracy on a data mining challenge dataset prepared by Microsoft research. These works use supervised setup with many biographical features; for instance, one of the above works even predict whether the author is Chinese or not, so that more customized model can be applied for these cases.…”
Section: Related Workmentioning
confidence: 99%
“…It is not surprising, because biographical features are highly effective for name disambiguation. For instance, a set of recent works [6,15] report around 99% accuracy on a data mining challenge dataset prepared by Microsoft research. These works use supervised setup with many biographical features; for instance, one of the above works even predict whether the author is Chinese or not, so that more customized model can be applied for these cases.…”
Section: Related Workmentioning
confidence: 99%
“…4 The task was to identify which authors in a large bibliographic database correspond to the same person. The winning solution [6] used string similarity measures and an ensemble classifier for two concurrent matcher implementations, as well as processed Chinese and non-Chinese names separately. A recent Multilingual Web Person Name Disambiguation shared task 5 consisted of clustering Web search results for a person name query accounting for different real-world persons [30].…”
Section: Related Workmentioning
confidence: 99%
“…In general, author disambiguation includes two main steps, measuring the similarity and clustering similar records [7]. The main challenge is the identification of whether two authors in the same or different DLs have the same identity or not.…”
Section: Related Workmentioning
confidence: 99%
“…The data are represented as vector space model where the distance between vectors represents the similarity. Such algorithms include the Cosine Similarity (CS) with TF-IDF, Jaccard Similarity, Jaro Winkler, and Levenshtein algorithms [7,9,[12][13][14].…”
Section: Related Workmentioning
confidence: 99%