Objective Machine learning is used to understand and track influenza-related content on social media. Because these systems are used at scale, they have the potential to adversely impact the people they are built to help. In this study, we explore the biases of different machine learning methods for the specific task of detecting influenza-related content. We compare the performance of each model on tweets written in Standard American English (SAE) vs African American English (AAE). Materials and Methods Two influenza-related datasets are used to train 3 text classification models (support vector machine, convolutional neural network, bidirectional long short-term memory) with different feature sets. The datasets match real-world scenarios in which there is a large imbalance between SAE and AAE examples. The number of AAE examples for each class ranges from 2% to 5% in both datasets. We also evaluate each model's performance using a balanced dataset via undersampling. Results We find that all of the tested machine learning methods are biased on both datasets. The difference in false positive rates between SAE and AAE examples ranges from 0.01 to 0.35. The difference in the false negative rates ranges from 0.01 to 0.23. We also find that the neural network methods generally has more unfair results than the linear support vector machine on the chosen datasets. Conclusions The models that result in the most unfair predictions may vary from dataset to dataset. Practitioners should be aware of the potential harms related to applying machine learning to health-related social media data. At a minimum, we recommend evaluating fairness along with traditional evaluation metrics.
Customer-agent conversations (i.e. call transcripts) are invaluable source for companies as they convey direct information from their customers implicit and explicit behaviour. Identifying customerrelated events is an important task in customer services which is possible from the call transcripts. However, call centers produces a vast amount of transcripts which makes the manual or semi-manual processing of such raw datasets quite challenging. Furthermore, customer-agent call transcripts tend not to explicitly denote events that might be beneficial to customer services. Albeit being highly researched across multiple domains in the literature, event detection, especially implicit life event detection have not been well examined from call transcripts due to a lack of proper large-scale dataset. In this research, we propose a novel deep learning approach based on latent topic modeling and deep recurrent neural networks with memory units to automatically detect implicit events from a customer's history of call transcripts. These implicit events are detected prior to the report date of that event thereby not containing any explicit topic/feature. We provide a case study on a real-life, large-scale data of more than 800K call transcripts from a large financial services company in the U.S. to examine the practical features and challenges of this problem. The evaluation results demonstrate the potential applicability of our implicit life event detection as it achieves a macro-recall score of 53 (macro-f1 of 47.5) on a highly imbalanced test set, negative samples are 95% of the data. Our model beats the the state-of-the-art text classification benchmarks by macro-f1 score of 5.6 and macro-recall of 8.8 on average, and performs better than the ensemble of all single-document and sequential classification benchmarks albeit being significantly less complex. The comparison results show the importance as well as our model's capability of capturing the mutual information of a sequence of call transcripts in detecting the implicit life events. INDEX TERMS Implicit event discovery, call transcripts, deep learning, recurrent neural network, machine learning, natural language processing, text classification, topic modeling, event detection.
Public health surveillance and tracking virus via social media can be a useful digital tool for contact tracing and preventing the spread of the virus. Nowadays, large volumes of COVID-19 tweets can quickly be processed in real-time to offer information to researchers. Nonetheless, due to the absence of labeled data for COVID-19, the preliminary supervised classifier or semi-supervised self-labeled methods will not handle non-spherical data with adequate accuracy. With the seasonal influenza and novel Coronavirus having many similar symptoms, we propose using few shot learning to fine-tune a semi-supervised model built on unlabeled COVID-19 and previously labeled influenza dataset that can provide insights into COVID-19 that have not been investigated. The experimental results show the efficacy of the proposed model with an accuracy of 86%, identification of Covid-19 related discussion using recently collected tweets.
While pre-trained word embeddings have been shown to improve the performance of downstream tasks, many questions remain regarding their reliability: Do the same pre-trained word embeddings result in the best performance with slight changes to the training data? Do the same pre-trained embeddings perform well with multiple neural network architectures? Do imputation strategies for unknown words impact reliability? In this paper, we introduce two new metrics to understand the downstream reliability of word embeddings. We find that downstream reliability of word embeddings depends on multiple factors, including, the evaluation metric, the handling of out-of-vocabulary words, and whether the embeddings are fine-tuned.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.