Distributed representations of tuples for entity resolution

Ebraheem, Muhammad; Thirumuruganathan, Saravanan; Joty, Shafiq; Ouzzani, Mourad; Tang, Nan

doi:10.14778/3236187.3269461

Cited by 85 publications

(157 citation statements)

References 33 publications

Supporting

Mentioning

154

Contrasting

Unclassified

Order By: Relevance

“…Entity matching methods can broadly be divided into rule-based, crowd-based, and machine learning-based methods [5,6,14]. Since 2018, an increasing number of neural network-based matching methods [13,23,30] have been proposed and have pushed the state-of-the-art performance especially for textual entity matching tasks [1]. We include Deepmatcher [23] into our experiments as an example of one of the initial neural network based matching systems.…”

Section: Related Workmentioning

confidence: 99%

Dual-objective fine-tuning of BERT for entity matching

Peeters

Bizer

2021

Proc. VLDB Endow.

View full text Add to dashboard Cite

An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

show abstract

Section: Related Workmentioning

confidence: 99%

Dual-objective fine-tuning of BERT for entity matching

Peeters

Bizer

2021

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…Distributed representation of records (DeepER). This is a recently proposed approach which applies a distributed representation of words (Ebraheem et al , 2018) for constructing a distributed representation of records. For each token (word) within an attribute value its distributed representation is obtained from one of the pre-trained embedding dictionaries.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…In particular, the application of machine learning (ML) offers a promising approach, which can be applied as an alternative to manual rule building (Köpcke et al , 2010). However, the existing ML-based approaches to RL are based on the assumption that the data obtained from different sources is structured and represented by overlapping sets of attributes (Ebraheem et al , 2018; Elfeky et al , 2002; Jurek et al , 2017; Kejriwal and Miranker, 2015; Ngomo and Lyko, 2013; Schneider et al , 2018; Sherif et al , 2017; Wang et al , 2015). This is very restrictive in terms of real world applications, given the increasing number of unstructured data sources such as social media channels, for example.…”

Section: Introductionmentioning

confidence: 99%

Deep learning based approach to unstructured record linkage

Jurek-Loughrey

2021

IJWIS

View full text Add to dashboard Cite

Purpose In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data. Design/methodology/approach In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection. Findings The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection. Originality/value To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

show abstract

“…Then the matching step determines if each pair in the candidate set is a match. To our knowledge, as of November 2018 there have been only two published work on entity matching using deep learning: DeepER [33] and DeepMatcher [72]. We now describe both.…”

Section: Entity Matchingmentioning

confidence: 99%

“…Finally, given a set of tuples (e.g., the union of the two tables to be matched), we pass each tuple through all L hash tables to obtain a list of blocks. Then the candidate set for matching consists of all tuple pairs that appear together in at least one block (there are pruning strategies to further reduce the candidate set size, see [33]).…”

Section: Entity Matchingmentioning

confidence: 99%

Deep Learning for Semantic Matching: A Survey

Han

Govind

Mudgal

et al. 2021

JCC

View full text Add to dashboard Cite

Semantic matching finds certain types of semantic relationships among schema/data constructs. Examples include entity matching, entity linking, coreference resolution, schema/ontology matching, semantic text similarity, textual entailment, question answering, tagging, etc. Semantic matching has received much attention in the database, AI, KDD, Web, and Semantic Web communities. Recently, many works have also applied deep learning (DL) to semantic matching. In this paper we survey this fast growing topic. We define the semantic matching problem, categorize its variations into a taxonomy, and describe important applications. We describe DL solutions for important variations of semantic matching. Finally, we discuss future R\&D directions.

show abstract

Distributed representations of tuples for entity resolution

Cited by 85 publications

References 33 publications

Dual-objective fine-tuning of BERT for entity matching

Dual-objective fine-tuning of BERT for entity matching

Deep learning based approach to unstructured record linkage

Deep Learning for Semantic Matching: A Survey

Contact Info

Product

Resources

About