We propose VADEC, a multi-task framework that exploits the correlation between the categorical and dimensional models of emotion representation for better subjectivity analysis. Focusing primarily on the effective detection of emotions from tweets, we jointly train multi-label emotion classification and multi-dimensional emotion regression, thereby utilizing the inter-relatedness between the tasks. Co-training especially helps in improving the performance of the classification task as we outperform the strongest baselines with 3.4%, 11%, and 3.9% gains in Jaccard Accuracy, Macro-F1, and Micro-F1 scores respectively on the AIT dataset [17]. We also achieve state-of-the-art results with 11.3% gains averaged over six different metrics on the SenWave dataset [27]. For the regression task, VADEC, when trained with SenWave, achieves 7.6% and 16.5% gains in Pearson Correlation scores over the current state-of-the-art on the EMOBANK dataset [5] for the Valence (V) and Dominance (D) affect dimensions respectively. We conclude our work with a case study on COVID-19 tweets posted by Indians that further helps in establishing the efficacy of our proposed solution. CCS CONCEPTS• Information systems → Sentiment analysis.
Background Adverse drug events (ADEs) are unintended side effects of drugs that cause substantial clinical and economic burdens globally. Not all ADEs are discovered during clinical trials; therefore, postmarketing surveillance, called pharmacovigilance, is routinely conducted to find unknown ADEs. A wealth of information, which facilitates ADE discovery, lies in the growing body of biomedical literature. Knowledge graphs (KGs) encode information from the literature, where the vertices and the edges represent clinical concepts and their relations, respectively. The scale and unstructured form of the literature necessitates the use of natural language processing (NLP) to automatically create such KGs. Previous studies have demonstrated the utility of such literature-derived KGs in ADE prediction. Through unsupervised learning of the representations (features) of clinical concepts from the KG, which are used in machine learning models, state-of-the-art results for ADE prediction were obtained on benchmark data sets. Objective Due to the use of NLP to infer literature-derived KGs, there is noise in the form of false positive (erroneous) and false negative (absent) nodes and edges. Previous representation learning methods do not account for such inaccuracies in the graph. NLP algorithms can quantify the confidence in their inference of extracted concepts and relations from the literature. Our hypothesis, which motivates this work, is that by using such confidence scores during representation learning, the learned embeddings would yield better features for ADE prediction models. Methods We developed methods to use these confidence scores on two well-known representation learning methods—DeepWalk and Translating Embeddings for Modeling Multi-relational Data (TransE)—to develop their weighted versions: Weighted DeepWalk and Weighted TransE. These methods were used to learn representations from a large literature-derived KG, the Semantic MEDLINE Database, which contains more than 93 million clinical relations. They were compared with Embedding of Semantic Predications, which, to our knowledge, is the best reported representation learning method using the Semantic MEDLINE Database with state-of-the-art results for ADE prediction. Representations learned from different methods were used (separately) as features of drugs and diseases to build classification models for ADE prediction using benchmark data sets. The methods were compared rigorously over multiple cross-validation settings. Results The weighted versions we designed were able to learn representations that yielded more accurate predictive models than the corresponding unweighted versions of both DeepWalk and TransE, as well as Embedding of Semantic Predications, in our experiments. There were performance improvements of up to 5.75% in the F1-score and 8.4% in the area under the receiver operating characteristic curve value, thus advancing the state of the art in ADE prediction from literature-derived KGs. Conclusions Our classification models can be used to aid pharmacovigilance teams in detecting potentially new ADEs. Our experiments demonstrate the importance of modeling inaccuracies in the inferred KGs for representation learning.
Named Entity Recognition (NER) is important in the cybersecurity domain. It helps researchers extract cyber threat information from unstructured text sources. The extracted cyberentities or key expressions can be used to model a cyber-attack described in an open-source text. A large number of generalpurpose NER algorithms have been published that work well in text analysis. These algorithms do not perform well when applied to the cybersecurity domain. In the field of cybersecurity, the open-source text available varies greatly in complexity and underlying structure of the sentences. General-purpose NER algorithms can misrepresent domain-specific words, such as "malicious" and "javascript". In this paper, we compare the recent deep learningbased NER algorithms on a cybersecurity dataset. We created a cybersecurity dataset collected from various sources, including "Microsoft Security Bulletin" and "Adobe Security Updates". Some of these approaches proposed in literature were not used for Cybersecurity. Others are innovations proposed by us. This comparative study helps us identify the NER algorithms that are robust and can work well in sentences taken from a large number of cybersecurity sources. We tabulate their performance on the test set and identify the best NER algorithm for a cybersecurity corpus. We also discuss the different embedding strategies that aid in the process of NER for the chosen deep learning algorithms.
BACKGROUND Adverse Drug Events (ADEs) are unintended side-effects of drugs that cause substantial clinical and economic burden globally. Not all ADEs are discovered during clinical trials and so, post-marketing surveillance, called pharmacovigilance, is routinely conducted to find unknown ADEs. A wealth of information, that facilitates ADE discovery, lies in the enormous and continuously growing body of biomedical literature. Knowledge graphs (KG) encode information from the literature, where vertices and edges represent clinical concepts and their relations respectively. The scale and unstructured form of the literature necessitates the use of natural language processing (NLP) to automatically create such KGs. Previous studies have demonstrated the utility of such literature-derived KGs in ADE prediction. Through unsupervised learning of representations (features) of clinical concepts from the KG, that are used in machine learning models, state-of-the-art results for ADE prediction were obtained on benchmark datasets. OBJECTIVE In literature-derived KGs there is `noise’ in the form of false positive (erroneous) and false negative (absent) nodes and edges due to limitations of the NLP techniques used to infer the KGs. Previous representation learning methods do not account for such inaccuracies in the graph. NLP algorithms can quantify the confidence in their inference of extracted concepts and relations from the literature. Our hypothesis that motivates this work is that by utilizing such confidence scores during representation learning, the learnt embeddings would yield better features for ADE prediction models. METHODS We develop methods to utilize these confidence scores on two well-known representation learning methods – Deepwalk and TransE – to develop their `weighted’ versions – Weighted Deepwalk and Weighted TransE. These methods are used to learn representations from a large literature-derived KG, SemMedDB, containing more than 93 million clinical relations. They are compared with Embeddings of Sematic Predictions (ESP), that, to our knowledge, is the best reported representation learning method on SemMedDB with state-of-the-art results for ADE prediction. Representations learnt from different methods are used (separately) as features of drugs and diseases to build classification models for ADE prediction using benchmark datasets. The classification performance of all the methods is compared rigorously over multiple cross-validation settings. RESULTS The `weighted’ versions we design are able to learn representations that yield more accurate predictive models compared to both the corresponding unweighted versions of Deepwalk and TransE, as well as ESP, in our experiments. Performance improvements are up to 5.75% in F1 score and 8.4% in AUC, thus advancing the state-of-the-art in ADE prediction from literature-derived KGs. Implementation of our new methods and all experiments are available at https://bitbucket.org/cdal/kb_embeddings. CONCLUSIONS Our classification models can be used to aid pharmacovigilance teams in detecting potentially new ADEs. Our experiments demonstrate the importance of modelling inaccuracies in the inferred KGs for representation learning, which may also be useful in other predictive models that utilize literature-derived KGs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.