Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous data. In this survey, we review a large number of relevant works under two different but related frameworks: Blocking and Filtering. The former restricts comparisons to entity pairs that are more likely to match, while the latter identifies quickly entity pairs that are likely to satisfy predetermined similarity thresholds. We also elaborate on hybrid approaches that combine different characteristics. For each framework we provide a comprehensive list of the relevant works, discussing them in the greater context. We conclude with the most promising directions for future work in the field.
Entity Resolution (ER) is the task of detecting different entity profiles that describe the same real-world objects. To facilitate its execution, we have developed JedAI, an open-source system that puts together a series of state-of-the-art ER techniques that have been proposed and examined independently, targeting parts of the ER end-to-end pipeline. This is a unique approach, as no other ER tool brings together so many established techniques. Instead, most ER tools merely convey a few techniques, those primarily developed by their creators. In addition to democratizing ER techniques, JedAI goes beyond the other ER tools by offering a series of unique characteristics: (i) It allows for building and benchmarking millions of ER pipelines. (ii) It is the only ER system that applies seamlessly to any combination of structured and/or semi-structured data. (iii) It constitutes the only ER system that runs seamlessly both on stand-alone computers and clusters of computers-through the parallel implementation of all algorithms in Apache Spark. (iv) It supports two different end-to-end workflows for carrying out batch ER (i.e., budget-agnostic), a schema-agnostic one based on blocks, and a schema-based one relying on similarity joins. (v) It adapts both end-to-end workflows to budget-aware (i.e., progressive) ER. We present in detail all features of JedAI, stressing the core characteristics that enhance its usability, and boost its versatility and effectiveness. We also compare it to the state-of-the-art in the field, qualitatively and quantitatively, demonstrating its state-of-the-art performance over a variety of large-scale datasets from different domains.
Congested traffic networks impose great difficulties upon scheduling material transportation in industrial settings. Dispatching decisions become complex when, in addition to conflict-free routes, storage location must also be considered. The present paper integrates the dispatch and conflict-free routing of capacitated vehicles with the storage allocation problem and proposes a heuristic approach for minimising the total completion time of box pickup, delivery and relocation requests in a warehouse environment. The developed formulation models a warehouse setting in which boxes are stored in stacks and considers different forward and reverse speeds for the vehicles, real-world movement restrictions and a LIFO loading policy. The proposed approach is evaluated using two different metaheuristics and the results are compared against existing schedules prepared by an industry partner. Experimental results demonstrate the effectiveness of the proposed method for large-scale instances and highlight how improved storage allocation decisions can reduce service times.
We present JedAI, a new open-source toolkit for endto- end Entity Resolution. JedAI is domain-agnostic in the sense that it does not depend on background expert knowledge, applying seamlessly to data of any domain with minimal human intervention. JedAI is also structure-agnostic, as it can process any type of data, ranging from structured (relational) to semi-structured (RDF) and un-structured (free-text) entity descriptions. JedAI consists of two parts: (i) JedAI-core is a library of numerous state-of-the-art methods that can be mixed and matched to form (thousands of) end-to-end workflows, allowing for easily benchmarking their relative performance. (ii) JedAI-gui is a user-friendly desktop application that facilitates the composition of complex workflows via a wizard-like interface. It is suitable for both lay and power users, offering concrete guidelines and automatic configuration, as well as manual configuration options, visual exploration, and detailed statistics for each method's performance. In this paper, we also delve into the new features of JedAI's latest version (2.1), and demonstrate its performance experimentally.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.