Efficient Name Disambiguation for Large-Scale Databases

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GM D) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GM D is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GM D measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GM D measure, and the widely used pairwise F1 measure can be directly computed using GM D. We present an efficient lineartime algorithm that correctly computes the GM D measure for a large class of cost functions that satisfy reasonable properties.

show abstract

“…• cF1 [7]: Combines the fraction of clusters from R that are also in S and the fraction of clusters from S in R.…”

Section: Existing Measuresmentioning

confidence: 99%

“…The cluster F1 measure [7,2] counts clusters that exactly match and is defined as the harmonic mean of the cluster precision and cluster recall. The cluster precision is defined as |R∩S| |R| while the cluster recall is defined as |R∩S| |S|…”

Section: A2 Cluster-level Comparisonmentioning

confidence: 99%

Evaluating entity resolution results

2010

View full text Add to dashboard Cite

show abstract

“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”

Section: Background and Related Workmentioning

confidence: 99%

Detecting referential inconsistencies in electronic CV datasets

Rubim

Braganholo

2017

J Braz Comput Soc

View full text Add to dashboard Cite

One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.

show abstract

“…For example, even if author A and author B are classified as one person, and author B and author C are also classified as one person, author A and author C may be classified as two different persons. By applying density-based spacial clustering of application with noise (DBSCAN), a clustering algorithm based on the density reachability of data points, CiteSeerX resolves most of these inconsistent cases (Huang, Ertekin, and Giles 2006). The remaining small portion of ambiguous cases are those located at cluster boundaries.…”

Section: Author Disambiguationmentioning

confidence: 99%