Short text classification is an important branch of Natural Language Processing. Although CNN and RNN have achieved satisfactory results in the text classification tasks, they are difficult to apply to the Chinese short text classification because of the data sparsity and the homophonic typos problems of them. To solve the above problems, word-level and Pinyin-level based Chinese short text classification model is constructed. Since homophones have the same Pinyin, the addition of Pinyin-level features can solve the homophonic typos problem. In addition, due to the introduction of more features, the data sparsity problem of short text can be solved. In order to fully extract the deep hidden features of the short text, a deep learning model based on BiLSTM, Attention and CNN is constructed, and the residual network is used to solve the gradient disappearance problem with the increase of network layers. Additionally, considering that the complex deep learning network structure will increase the text classification time, the Text Center is constructed. When there is a new text input, the text classification task can be quickly realized by calculating the Manhattan distance between the embedding vector of it and the vectors stored in the Text Center. The Accuracy, Precision, Recall and F1 of the proposed model on the simplifyweibo_4_moods dataset are 0.9713, 0.9627, 0.9765 and 0.9696 respectively, and those on the online_shopping_10_cats dataset are 0.9533, 0.9416, 0.9608 and 0.9511 respectively, which are better than that of the baseline method. In addition, the classification time of the proposed model on simplifyweibo_4_moods and online_shopping_10_cats is 0.0042 and 0.0033 respectively, which is far lower than that of the baseline method.
Network attacks using Command and Control (C&C) servers have increased significantly. To hide their C&C servers, attackers often use Domain Generation Algorithms (DGA), which automatically generate domain names for C&C servers. Researchers have constructed many unique feature sets and detected DGA domains through machine learning or deep learning models. However, due to the limited features contained in the domain name, the DGA detection results are limited. In order to overcome this problem, the domain name features, the Whois features and the N-gram features are extracted for DGA detection. To obtain the N-gram features, the domain name whitelist and blacklist substring feature sets are constructed. In addition, a deep learning model based on BiLSTM, Attention and CNN is constructed. Additionally, the Domain Center is constructed for fast classification of domain names. Multiple comparative experiment results prove that the proposed model not only gets the best Accuracy, Precision, Recall and F1, but also greatly reduces the detection time.
Short text classification is an important task in Natural Language Processing (NLP). The classification result for Chinese short text is always not ideal due to the sparsity problem of them. Most of the previous classification models for Chinese short text are based on word or character, considering that Chinese radical can also represent the meaning individually, so word, character and radical are all used to build a Chinese short text classification model in this paper, which solves the data sparsity problem of short text. In addition, in the process of segmenting sentences into words, considering that jieba will cause the loss of key information and ngram will generate noise words, both jieba and ngram are used to construct a six-granularity (i.e. word-jieba, word-jieba-radical, word-ngram, word-ngram-radical, character and character-radical) based Chinese short text classification (SGCSTC) model. Additionally, different weights are assigned to the six granularities and are automatically updated in the process of backpropagation using cross-entropy loss due to the different influence of them on the classification results. The classification Accuracy, Precision, Recall and F1 of SGCSTC in THUCNews-S dataset are 93.36%, 94.47%, 94.15% and 94.31% respectively, and that in CNT dataset are 92.67%, 92.38%, 93.15% and 92.76% respectively, and multiple comparative experiment results on THUCNews-S and CNT datasets show that SGCSTC outperforms the state-of-the-art text classification models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.