Word embedding models have gained a lot of traction in the Natural Language Processing community, however, they suffer from unintended demographic biases. Most approaches to evaluate these biases rely on vector space based metrics like the Word Embedding Association Test (WEAT). While these approaches offer great geometric insights into unintended biases in the embedding vector space, they fail to offer an interpretable meaning for how the embeddings could cause discrimination in downstream NLP applications. In this work, we present a transparent framework and metric for evaluating discrimination across protected groups with respect to their word embedding bias. Our metric (Relative Negative Sentiment Bias, RNSB) measures fairness in word embeddings via the relative negative sentiment associated with demographic identity terms from various protected groups. We show that our framework and metric enable useful analysis into the bias in word embeddings.
The para-linguistic information in a speech signal includes clues to the geographical and social background of the speaker. This paper is concerned with recognition of the 14 regional accents of British English. For Accent Identification (AID), acoustic methods exploit differences between the distributions of sounds, while phonotactic approaches exploit the sequences in which these sounds occur. We demonstrate these methods are good complements for each other and use their confusion matrices for further analysis. Our relatively simple i-vector and phonotactic fused system with recognition accuracy of 84.87% outperforms the i-vector fused results reported in literature, by 4.7%. Further analysis on distribution of British English accents has been carried out by analyzing the low dimensional representation of i-vector AID feature space. Index terms: Accent identification, I-vector, Phonotactic, British English regional accents 'short passages' (SPA, SPB and SPC), the 'short sentences' and the 'short phrases'. These are described below: • SPA, SPB and SPC are short paragraphs, of lengths 92, 92 and 107 words, respectively, which together form the accent-diagnostic 'sailor passage' (When a sailor in a
As a continuation of our efforts towards tackling the problem of spoken Dialect Identification (DID) for Arabic languages, we present the QCRI-MIT Advanced Dialect Identification System (QMDIS). QMDIS is an automatic spoken DID system for Dialectal Arabic (DA). In this paper, we report a comprehensive study of the three main components used in the spoken DID task: phonotactic, lexical and acoustic. We use Support Vector Machines (SVMs), Logistic Regression (LR) and Convolutional Neural Networks (CNNs) as backend classifiers throughout the study. We perform all our experiments on a publicly available dataset and present new state-of-the-art results. QMDIS discriminates between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and Modern Standard Arabic (MSA). We report ≈ 73% accuracy for system combination. All the data and the code used in our experiments are publicly available for research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.