Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, differently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a first phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.
Decisions based on algorithms and systems generated from data have become essential tools that pervade all aspects of our daily lives; in order for these advances to be reliable, the results should be accurate, but should also respect all the facets of data equity [11]. In this context, the concepts of Fairness and Diversity have become relevant topics of discussion within the field of Data Science Ethics, and in general in Data Science. Although data equity is desirable, reconciling this property with accurate decision making is a critical trade-off, because applying a repair procedure to restore equity might modify the original data in such a way that the final decision is inaccurate w.r.t. the ultimate objective of the analysis. In this work we propose E-FAIR-DB, a novel solution that, exploiting the notion of Functional Dependency - a type of data constraint - aims at restoring data equity by discovering and solving discrimination in datasets. The proposed solution is implemented as a pipeline that, firstly, mines functional dependencies to detect and evaluate fairness and diversity in the input dataset, and then, based on these understandings and on the objective of the data analysis, mitigates data bias, minimizing the number of modifications. Our tool can identify, through the mined dependencies, the attributes of the database that encompass discrimination (e.g. gender, ethnicity or religion); then, based on these dependencies, it determines the smallest amount of data that must be added and/or removed to mitigate such bias. We evaluate our proposal both through theoretical considerations and experiments on two real-world datasets.
The ever-increasing number of systems based on semantic text analysis is making natural language understanding a fundamental task: embedding-based language models are used for a variety of applications, such as resume parsing or improving web search results. At the same time, despite their popularity and widespread use, concern is rapidly growing due to their display of social bias and lack of transparency. In particular, they exhibit a large amount of gender bias, favouring the consolidation of social stereotypes. Recently, sentence embeddings have been introduced as a novel and powerful technique to represent entire sentences as vectors. We propose a new metric to estimate gender bias in sentence embeddings, named bias score. Our solution leverages semantic importance of words and previous research on bias in word embeddings, and it is able to discern between neutral and biased gender information at sentence level. Experiments on a real-world dataset demonstrate that our novel metric can identify gender stereotyped sentences. Furthermore, we employ bias score to detect and then remove or compensate for the more stereotyped entries in text corpora used to train sentence encoders, improving their degree of fairness. Finally, we prove that models retrained on fairer corpora are less prone to make stereotypical associations compared to their original counterpart, while preserving accuracy in natural language understanding tasks. Additionally, we compare our experiments with traditional methods for reducing bias in embedding-based language models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.