Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).
Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching against large-scale reference databases within a short time. We define this as the query matching problem in ER in this work. Indexing or blocking techniques reduce the search space and execution time in the ER process. However, approximate indexing techniques that scale to very large-scale datasets remain open to research. In this paper, we investigate the query matching problem in ER to propose an indexing method suitable for approximate and efficient query matching.We first use spatial mappings to embed records in a multidimensional Euclidean space that preserves the domain-specific similarity. Among the various mapping techniques, we choose multidimensional scaling. Then using a Kd-tree and the nearest neighbour search, the method returns a block of records that includes potential matches for a query. Our method can process queries against a large-scale dataset using only a fraction of the data ๐ฟ (given the dataset size is ๐ ), with a ๐ (๐ฟ 2 ) complexity where ๐ฟ โช ๐ . The experiments conducted on several datasets showed the effectiveness of the proposed method.
The recent rapid growth of the dimension of many datasets means that many approaches to dimension reduction (DR) have gained significant attention. High-performance DR algorithms are required to make data analysis feasible for big and fast data sets. However, many traditional DR techniques are challenged by truly large data sets. In particular multidimensional scaling (MDS) does not scale well. MDS is a popular group of DR techniques because it can perform DR on data where the only input is a dissimilarity function. However, common approaches are at least quadratic in memory and computation and, hence, prohibitive for large-scale data.We propose an out-of-sample embedding (OSE) solution to extend the MDS algorithm for large-scale data utilising the embedding of only a subset of the given data. We present two OSE techniques: the first based on an optimisation approach and the second based on a neural network model. With a minor trade-off in the approximation, the out-of-sample techniques can process large-scale data with reasonable computation and memory requirements. While both methods perform well, the neural network model outperforms the optimisation approach of the OSE solution in terms of efficiency. OSE has the dual benefit that it allows fast DR on streaming datasets as well as static databases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citationsโcitations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright ยฉ 2024 scite LLC. All rights reserved.
Made with ๐ for researchers
Part of the Research Solutions Family.