Effective DGA-Domain Detection and Classification with TextCNN and Additional Features

Hwang, Chanwoong; Kim, Hyo-Sik; Lee, Hooki; Lee, Tae-Jin

doi:10.3390/electronics9071070

Cited by 11 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…TextCNN 52 is a text classification model based on CNN proposed by Yoon Kim in 2014. Due to its extraordinary ability in extracting text-related regions and features from image components, 53 this model has been widely applied in many different research fields, such as feature extraction, 54 , 55 classification, 56 program detection, 57 and etc. The TextCNN model performs convolutional operations on the word vectors by three convolutional kernels in the convolutional layer to generate feature vectors, performs maximum pooling of the convolved feature vectors in the pooling layer, and finally outputs the features in the fully connected layer.…”

Section: Methodsmentioning

confidence: 99%

Chronic disease diagnosis model based on convolutional neural network and ensemble learning method

Zhou,

Zhang,

Zou

et al. 2023

DIGITAL HEALTH

View full text Add to dashboard Cite

Introduction Chronic diseases have become one of the main causes of premature death all around the world in recent years. The diagnosis of chronic diseases is time-consuming and costly. Therefore, timely diagnosis and prediction of chronic diseases are very necessary. Methods In this paper, a new method for chronic disease diagnosis is proposed by combining convolutional neural network (CNN) and ensemble learning. This method utilizes random forest (RF) as the base classifier to improve classification performance and diagnostic accuracy, and then combines AdaBoost to successfully replace the Softmax layer of CNN to generate multiple accurate base classifiers while determining their optimal attributes, achieving high-quality classification and prediction of chronic diseases. Results To verify the effectiveness of the proposed method, real-world Electronic Medical Records dataset (C-EMRs) was used for experimental analysis. The results show that compared with other traditional machine learning methods such as CNN, K-Nearest Neighbor, and RF, the proposed method can effectively improve the accuracy of diagnosis and reduce the occurrence of missed diagnosis and misdiagnosis. Conclusions This study will provide effective information for the diagnosis of chronic diseases, assist doctors in making clinical decisions, develop targeted intervention measures, and reduce the probability of misdiagnosis.

show abstract

Section: Methodsmentioning

confidence: 99%

Chronic disease diagnosis model based on convolutional neural network and ensemble learning method

Zhou,

Zhang,

Zou

et al. 2023

DIGITAL HEALTH

View full text Add to dashboard Cite

show abstract

“…Hwang et al [27] used 10 context-free features and in addition they extracted 100 features using a TextCNN. The TextCNN takes as input a 70 × 100 matrix for each domain name, constructed by taking 100 characters from the domain name (using truncation for longer domain names and padding for shorter domain names) and one-hot encoding with a dictionary of 70 characters.…”

Section: Context-free Featuresmentioning

confidence: 99%

Detection of DGA-Generated Domain Names with TF-IDF

Vranken

Alizadeh

2022

Electronics

View full text Add to dashboard Cite

Botnets often apply domain name generation algorithms (DGAs) to evade detection by generating large numbers of pseudo-random domain names of which only few are registered by cybercriminals. In this paper, we address how DGA-generated domain names can be detected by means of machine learning and deep learning. We first present an extensive literature review on recent prior work in which machine learning and deep learning have been applied for detecting DGA-generated domain names. We observe that a common methodology is still missing, and the use of different datasets causes that experimental results can hardly be compared. We next propose the use of TF-IDF to measure frequencies of the most relevant n-grams in domain names, and use these as features in learning algorithms. We perform experiments with various machine-learning and deep-learning models using TF-IDF features, of which a deep MLP model yields the best results. For comparison, we also apply an LSTM model with embedding layer to convert domain names from a sequence of characters into a vector representation. The performance of our LSTM and MLP models is rather similar, achieving 0.994 and 0.995 AUC, and average F1-scores of 0.907 and 0.891 respectively.

show abstract

“…However, they create a whole detection chain by not only detecting the domains through semantic similarity, but also embedding the domains in case the semantic similarity did not trigger the alarms. On the same line, (Hwang, 2020) presents a method to detect and classify DGA by extracting features and passing them to a CNN-based model that labels the domains as DGA or legit. More deep learning based DGA detection research works can be found, such as (Tuan, 2022) where LSTM based techniques are used, or (Aravamudu, 2022) where various ML classifiers are tested against this task.…”

Section: Related Workmentioning

confidence: 99%

Siamese Neural Network and Machine Learning for DGA Classification

Segurola-Gil

Egüés²,

Zola³

et al. 2022

eccws

View full text Add to dashboard Cite

Domain Generation Algorithms (DGA) are systems used to create immediate multiple and varying domain names. Such “artificial” domains can be then used for siting command and control servers which in turn oversee recruiting/infecting devices, and finally turning them into new resources to be exploited. In this sense, identifying DGA domain names can be crucial, to avoid cyberattacks like Phishing, Spam sending, Bitcoin mining, and many other. Usually, domain names generated by DGAs, are comprised by illegible character strings, but new “intelligent” DGAs tend to generate names using combination of words in dictionaries making its detection a challenging task. For this reason, in this work, we propose to address this problem using a combination of Machine Learning algorithms for improving the classification of DGAs domains. In particular, we propose to combine Siamese Neural Networks and traditional supervised Machine Learning algorithms in order to expand the input domain into separable n-dimensional data points and then achieve the domain classification. The proposed approach can be separated into 3 phases. In a first phase, domain names are encoded, by a one-hot encoder and a variation of this, named probabilistic one-hot encoder, which are implemented separately. Then, in the second phase, Long Short-Term Memory and Convolutional Siamese embedders are tested and compared. In particular, the first one is combined with the one-hot, while the Convolution algorithm is applied with the probabilistic one-hot encoded data. In the final step, five Machine Learning algorithms are tested using the two ways embedded data. Both embedder approaches reach very high results in terms of F1-score and Accuracy (about 91%) depending on the implemented classifier. The promising results obtained by the application of the proposed method shows that it is possible to perform DGA domain classification uniquely over the domain names, without considering external information such as DNS packets features.

show abstract

Effective DGA-Domain Detection and Classification with TextCNN and Additional Features

Cited by 11 publications

References 18 publications

Chronic disease diagnosis model based on convolutional neural network and ensemble learning method

Chronic disease diagnosis model based on convolutional neural network and ensemble learning method

Detection of DGA-Generated Domain Names with TF-IDF

Siamese Neural Network and Machine Learning for DGA Classification

Contact Info

Product

Resources

About