Microblogging sites contain a huge amount of textual data and their classification is an imperative task in many applications, such as information filtering, user profiling, topical analysis, and content tagging. Traditional machine learning approaches mainly use a bag of words or n-gram techniques to generate feature vectors as text representation to train classifiers and perform considerably well for many text information processing tasks. Since short texts, such as tweets, contain a very limited number of words, the traditional machine learning approaches suffer from data sparsity and curse of dimensionality problems due to feature representation using a bag of words or n-grams techniques. Nowadays, the use of feature vectors, such as word embeddings, as an input to neural networks for text classification and clustering has shown a remarkable performance gain. In this paper, we present the different neural network models for multi-label classification of microblogging data. The proposed models are based on convolutional neural network (CNN) architectures, which utilize pre-trained word embeddings from generic and domain-specific textual data sources. The word embeddings are used individually and in various combinations through different channels of CNN to predict class labels. We also present a comparative analysis of the proposed CNN models with traditional machine learning models and one of the existing CNN architectures. The proposed models are evaluated over a real Twitter dataset, and the experimental results establish their efficacy to classify microblogging texts with improved accuracy in comparison with the traditional machine learning approaches and the existing CNN models.INDEX TERMS Social network analysis, machine learning, deep learning, multi-label classification, word embedding, convolution neural network.
A number of techniques such as information extraction, document classification, document clustering and information visualization have been developed to ease extraction and understanding of information embedded within text documents. However, knowledge that is embedded in natural language texts is difficult to extract using simple pattern matching techniques and most of these methods do not help users directly understand key concepts and their semantic relationships in document corpora, which are critical for capturing their conceptual structures. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured texts that computers can not interpret very easily. In this paper, we have presented a novel Biomedical Knowledge Extraction and Visualization framework, BioKEVis to identify key information components from biomedical text documents. The information components are centered on key concepts. BioKEVis applies linguistic analysis and Latent Semantic Analysis (LSA) to identify key concepts. The information component extraction principle is based on natural language processing techniques and semantic-based analysis. The system is also integrated with a biomedical named entity recognizer, ABNER, to tag genes, proteins and other entity names in the text. We have also presented a method for collating information extracted from multiple sources to generate semantic network. The network provides distinct user perspectives and allows navigation over documents with similar information components and is also used to provide a comprehensive view of the collection. The system stores the extracted information components in a structured repository which is integrated with a query-processing module to handle biomedical queries over text documents. We have also proposed a document ranking mechanism to present retrieved documents in order of their relevance to the user query.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.