Entity resolution with iterative blocking

Whang, Steven Euijong; Menestrina, David; Koutrika, Georgia; Theobald, Martin; García-Molina, Héctor

doi:10.1145/1559845.1559870

Cited by 173 publications

(116 citation statements)

References 21 publications

Supporting

Mentioning

114

Contrasting

Unclassified

Order By: Relevance

“…For example, blocking techniques [14] are commonly used to make ER scalable by dividing the data into (possibly overlapping) blocks and only comparing records within the same block, assuming that records in different blocks are unlikely to match. Since blocking techniques may miss matching records, their results are compared with an "exhaustive" ER solution without blocking, which is considered as the gold standard [15]. While large exhaustive ER results may be very expensive to generate, it need only be generated once, whereas the computation of the distance measure will be performed multiple times for a diverse set of blocking algorithms and parameters.…”

Section: Computing Measuresmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating entity resolution results

2010

Self Cite

View full text Add to dashboard Cite

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called "generalized merge distance" or GM D) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GM D is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GM D measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GM D measure, and the widely used pairwise F1 measure can be directly computed using GM D. We present an efficient lineartime algorithm that correctly computes the GM D measure for a large class of cost functions that satisfy reasonable properties.

show abstract

Section: Computing Measuresmentioning

confidence: 99%

“…To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were presented at The 36th International Conference on Very Large Data Bases, September [13][14][15][16][17]2010, Singapore. dard is generated by a group of human experts.…”

Section: Introductionmentioning

confidence: 99%

Evaluating entity resolution results

2010

Self Cite

View full text Add to dashboard Cite

show abstract

“…To correctly assess the impact of a researcher in a research field, correct attribution of research works is essential, so entity disambiguation has been extensively addressed by researchers in information retrieval and data mining. Note that, a related problem considers the task of merging multiple name references into a single entity, where the records belonging to a single person has been erroneously partitioned into multiple name references [2,3,20,27,28]. This task is more popularly known as entity deduplication or record linkage, and it is not the focus of this work.…”

mentioning

confidence: 99%

Name disambiguation from link data in a collaboration graph using temporal and topological features

Saha

Zhang

Hasan

2015

Soc. Netw. Anal. Min.

View full text Add to dashboard Cite

In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.

show abstract

“…Implementations may use sorting or hashing on the key. Overlapping methods may result in overlapping blocks of entities; implementations include the (multi-pass) sorted neighborhood approach [33], bi-gram indexing [4], canopy clustering [40] and iterative blocking [59]. These methods can require an entity to be matched against multiple blocks (increased overhead) but may lead to a better recall than disjoint methods.…”

Section: Blocking Methodsmentioning

confidence: 99%

Frameworks for entity matching: A comparison

Köpcke

Rahm

2010

Data & Knowledge Engineering

347

204

View full text Add to dashboard Cite

a b s t r a c tEntity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for entity matching. Our study considers both frameworks which do or do not utilize training data to semiautomatically find an entity matching strategy to solve a given match task. Moreover, we consider support for blocking and the combination of different match algorithms. We further study how the different frameworks have been evaluated. The study aims at exploring the current state of the art in research prototypes of entity matching frameworks and their evaluations. The proposed criteria should be helpful to identify promising framework approaches and enable categorizing and comparatively assessing additional entity matching frameworks and their evaluations.

show abstract

Entity resolution with iterative blocking

Cited by 173 publications

References 21 publications

Evaluating entity resolution results

Evaluating entity resolution results

Name disambiguation from link data in a collaboration graph using temporal and topological features

Frameworks for entity matching: A comparison

Contact Info

Product

Resources

About