Effective string processing and matching for author disambiguation

Chin, Wei-Sheng; Juan, Yu-Chin; Zhuang, Yun; Wu, Felix; Tung, Hsiao-Yu Fish; Yu, Tong; Wang, Jui-pin; Chang, Chun-Ti; Yang, Chun‐Pai; Chang, Wei‐Cheng; Huang, Kuan-Hao; Kuo, Tsun-Cheng; Lin, Shan-Wei; Lin, Young-San; Lu, Yuchen; Su, Yu-Chuan; Wei, Cheng-Kuang; Yin, Tu-Chun; Li, Chun-Liang; Lin, Ting-Wei; Tsai, Chia-Liang; Lin, Shou-De; Lin, Hung‐Mo

doi:10.1145/2517288.2517295

Cited by 20 publications

(18 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is not surprising, because biographical features are highly effective for name disambiguation. For instance, a set of recent works [6,15] report around 99% accuracy on a data mining challenge dataset prepared by Microsoft research. These works use supervised setup with many biographical features; for instance, one of the above works even predict whether the author is Chinese or not, so that more customized model can be applied for these cases.…”

Section: Related Workmentioning

confidence: 99%

Name disambiguation from link data in a collaboration graph using temporal and topological features

Saha

Zhang

Hasan

2015

Soc. Netw. Anal. Min.

View full text Add to dashboard Cite

In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Name disambiguation from link data in a collaboration graph using temporal and topological features

Saha

Zhang

Hasan

2015

Soc. Netw. Anal. Min.

View full text Add to dashboard Cite

show abstract

“…4 The task was to identify which authors in a large bibliographic database correspond to the same person. The winning solution [6] used string similarity measures and an ensemble classifier for two concurrent matcher implementations, as well as processed Chinese and non-Chinese names separately. A recent Multilingual Web Person Name Disambiguation shared task 5 consisted of clustering Web search results for a person name query accounting for different real-world persons [30].…”

Section: Related Workmentioning

confidence: 99%

Personal Names Popularity Estimation and Its Application to Record Linkage

Zhagorina¹,

Braslavski²,

Gusev³

2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

This study deals with a fairly simply formulated problemhow to estimate the number of people bearing the same full name in a large population. Estimation of name popularity can leverage personal name matching in databases and be of interest for many other domains. A distinctive feature of large collections of names is that they contain a large number of unique items, which is challenging for statistical modeling. We investigate a number of statistical techniques and also propose a simple yet effective method aimed at obtaining more accurate count estimates. In our experiments we use a dataset containing about 20 million name occurrences that correspond to about 13 million real-world persons. We perform a thorough evaluation of the name count estimation methods and a record linkage experiment guided by name popularity estimates. Obtained results suggest that theoretically informed approaches outperform simple heuristics and can be useful in a variety of applications.

show abstract

“…In general, author disambiguation includes two main steps, measuring the similarity and clustering similar records [7]. The main challenge is the identification of whether two authors in the same or different DLs have the same identity or not.…”

Section: Related Workmentioning

confidence: 99%

“…The data are represented as vector space model where the distance between vectors represents the similarity. Such algorithms include the Cosine Similarity (CS) with TF-IDF, Jaccard Similarity, Jaro Winkler, and Levenshtein algorithms [7,9,[12][13][14].…”

Section: Related Workmentioning

confidence: 99%

Author Profile Enrichment for Cross-Linking Digital Libraries

Hajra

Radevski

Tochtermann

2015

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

Abstract. This work aims at enriching author profiles with additional information to better support search and retrieval of publications across different digital libraries. To achieve this objective we exploit concepts for cross-linking data to identify correlations between one author and other authors, publications or other related information. We will introduce a profile enrichment approach which adds additional information (e.g. biographic information) from different sources to existing author profiles. Within this context, the linked open data repository DBpedia serves a valuable source for our profile enrichment approach. Still, one of several challenges in this context is the identification of the same author in different sources. To address this challenge we will exploit VIAF (virtual authority file) for author identification. Technically we apply data mining and clustering techniques to uniquely identify authors.

show abstract

Effective string processing and matching for author disambiguation

Cited by 20 publications

References 11 publications

Name disambiguation from link data in a collaboration graph using temporal and topological features

Name disambiguation from link data in a collaboration graph using temporal and topological features

Personal Names Popularity Estimation and Its Application to Record Linkage

Author Profile Enrichment for Cross-Linking Digital Libraries

Contact Info

Product

Resources

About