Efficient entity resolution for large heterogeneous information spaces

Papadakis, George; Ioannou, Ekaterini; Niederée, Claudia; Fankhauser, Péter

doi:10.1145/1935826.1935903

Cited by 74 publications

(103 citation statements)

References 24 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…Therefore, we are not only required to establish the map between 1453505_a_at and EIF2AK3, we also need to generate the transitive map between 1453505_a_at and PERK via EIF2AK3. In other words, the traditional record linkage [19], [20], [21], [22], [23], [24], [25] and mapping techniques largely do not help, and a higher level mapping method is warranted.…”

Section: Information Aggregation Using Id Conversionmentioning

confidence: 99%

“…BioFlow also supports two other statements called combine and link to facilitate entity based [19] (as opposed to tuple based) union and join respectively, which can be used to collect IDs from different sources, and for implementing the extend statement despite representation heterogeneities. The syntax for the two statements are presented below.…”

Section: Query Translationmentioning

confidence: 99%

See 1 more Smart Citation

Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage

Jamil

2015

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Traditionally, biological objects such as genes, proteins, and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this paper, we discuss this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. We also propose two declarative statements for ID conversion and data integration based on ID mapping on-the-fly.

show abstract

Section: Information Aggregation Using Id Conversionmentioning

confidence: 99%

Section: Query Translationmentioning

confidence: 99%

Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage

Jamil

2015

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

show abstract

“…These techniques deploy a variety of different methodologies and directions. These includes string similarity metrics [8,9] for computing the matching being the given textual representations of entities, the use of the available inner-relationships between the entities [17,27], clustering [7], and blocking techniques [30,31] for reducing the required execution time. However, as already explained in Sect.…”

Section: Related Workmentioning

confidence: 99%

Embracing Uncertainty in Entity Linking

Ioannou

Nejdl

Niederée

et al. 2012

Semantic Search Over the Web

Self Cite

View full text Add to dashboard Cite

“…Entity resolution (ER), also known as record linkage, entity reconciliation, or merge/purge, is the procedure of identifying a group of entities (records) representing the same realworld entity [1][2][3]. Generally speaking, ER has become the first step of data processing and widely used in many application domain, such as digital libraries, smart city, financial transactions, and social networks.…”

Section: Introductionmentioning

confidence: 99%

A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

Zhu

Jiang

et al. 2018

Journal of Sensors

View full text Add to dashboard Cite

In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.

show abstract

Efficient entity resolution for large heterogeneous information spaces

Cited by 74 publications

References 24 publications

Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage

Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage

Embracing Uncertainty in Entity Linking

A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

Contact Info

Product

Resources

About