One challenge for the modern recommendation systems is the Tyranny of Majority -the generated recommendations are often optimized for the mainstream trends so that the minority preference groups remain discriminated. Moreover, most modern recommendation techniques are characterized as black-box systems. Given a lack of understanding of the dataset characteristics and insufficient diversity of represented individuals, such approaches inevitably lead to amplifying hidden data biases and existing disparities. In this research, we address this problem by proposing a novel approach to detecting and describing potentially discriminated user groups for a given recommendation algorithm. We propose a Bias-Aware Hierarchical Clustering algorithm that identifies user clusters based on latent embeddings constructed by a black-box recommender to identify users whose needs are not met by the given recommendation method. Next, a post-hoc explainer model is applied to reveal the most important descriptive features that characterize these user segments. Our method is model-agnostic and does not require any a priori information about existing disparities and sensitive attributes. An experimental evaluation on a synthetic dataset and two real-world datasets from different domains shows that, compared with other clustering methods and arbitrarily selected user groups, our method is capable of identifying underperforming segments for different recommendation algorithms, and detect more severe disparities.
The aim of this research is to construct meaningful user profiles that are the most descriptive of user interests in the context of the media content that they browse. We use two distinct state-of-the-art numerical text-representation techniques: LDA topic modeling and Word2Vec word embeddings. We train our models on the collection of news articles in Polish and compare them with a model built on a general language corpus. We compare the performance of these algorithms on two practical tasks. First, we perform a qualitative analysis of the semantic relationships for similar article retrieval, and then we evaluate the predictive performance of distinct feature combinations for user gender classification. We apply the algorithms to the real-world dataset of Polish news service Onet. Our results show that the choice of text representation depends on the task -Word2Vec is more suitable for text comparison, especially for short texts such as titles. In the gender classification task, the best performance is obtained with a combination of features: topics from the article text and word embeddings from the title.
The cold-start scenario is a critical problem for recommendation systems, especially in dynamically changing domains such as online news services. In this research, we aim at addressing the cold-start situation by adapting an unsupervised neural User2Vec method to represent new users and articles in a multidimensional space. Toward this goal, we propose an extension of the Doc2Vec model that is capable of representing users with unknown history by building embeddings of their metadata labels along with item representations. We evaluate our proposed approach with respect to different parameter configurations on three real-world recommendation datasets with different characteristics. Our results show that this approach may be applied as an efficient alternative to the factorization machine-based method when the user and item metadata are used and hence can be applied in the cold-start scenario for both new users and new items. Additionally, as our solution represents the user and item labels in the same vector space, we can analyze the spatial relations among these labels to reveal latent interest features of the audience groups as well as possible data biases and disparities.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.