Abstract:Abstract. Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selecti… Show more
“…• cF1 [7]: Combines the fraction of clusters from R that are also in S and the fraction of clusters from S in R.…”
Section: Existing Measuresmentioning
confidence: 99%
“…The cluster F1 measure [7,2] counts clusters that exactly match and is defined as the harmonic mean of the cluster precision and cluster recall. The cluster precision is defined as |R∩S| |R| while the cluster recall is defined as |R∩S| |S|…”
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GM D) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GM D is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GM D measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GM D measure, and the widely used pairwise F1 measure can be directly computed using GM D. We present an efficient lineartime algorithm that correctly computes the GM D measure for a large class of cost functions that satisfy reasonable properties.
“…• cF1 [7]: Combines the fraction of clusters from R that are also in S and the fraction of clusters from S in R.…”
Section: Existing Measuresmentioning
confidence: 99%
“…The cluster F1 measure [7,2] counts clusters that exactly match and is defined as the harmonic mean of the cluster precision and cluster recall. The cluster precision is defined as |R∩S| |R| while the cluster recall is defined as |R∩S| |S|…”
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GM D) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GM D is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GM D measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GM D measure, and the widely used pairwise F1 measure can be directly computed using GM D. We present an efficient lineartime algorithm that correctly computes the GM D measure for a large class of cost functions that satisfy reasonable properties.
“…They use a mix of techniques. While some use similarity functions [2,7,12,18,21,27,30], others use learning techniques [1,14,16,28,32,35], heuristics [17,19,20,24], classifiers [9,10,34] and clustering methods [11,31].…”
One way to measure the scientific progress of a country is to evaluate the curriculum vitae (CV) of its researchers. In Brazil, this is not different. The Lattes Platform is an information system whose primary objective is to provide a single repository to store the CV of the Brazilian researchers. This system is increasingly acquiring expressiveness as the main source of information regarding the Brazilian community of researchers, students, managers, and other actors in the national system of science, technology, and innovation. However, the integrity of this important tool for gaging the national bibliographic production may be affected by the effect of ambiguities or referential inconsistencies in coauthoring citations. A first step towards solving this problem lies in identifying such inconsistencies. For that, we propose a heuristic-based approach that uses similarity search to match papers from coauthors of CV. We then use this technique to analyze over 2000 curricula of researchers from a given institution recovered from the Lattes Platform. The results indicate 18.98% of the analyzed publications present referential inconsistencies, which is a significant amount for a dataset that is supposed to be correct and trustable.
“…For example, even if author A and author B are classified as one person, and author B and author C are also classified as one person, author A and author C may be classified as two different persons. By applying density-based spacial clustering of application with noise (DBSCAN), a clustering algorithm based on the density reachability of data points, CiteSeerX resolves most of these inconsistent cases (Huang, Ertekin, and Giles 2006). The remaining small portion of ambiguous cases are those located at cluster boundaries.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.