An efficient incremental learning mechanism for tracking concept drift in spam filtering

Sheu, Jyh-Jian; Chu, Ko-Tsung; Li, Nien-Feng; Lee, Cheng‐Chi

doi:10.1371/journal.pone.0171518

Cited by 20 publications

(14 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are four commonly used techniques for spam classification namely, a) Use of blacklist [14] b) Protocol-based approach c) Use of keywords or content filtering d) Header based [20], [28], [21], [5], [36], [13] In the first case, a list of email the network administrator maintains addresses or domain name databases. The classifier matches new record with blacklisted database and simply rejects some mails and puts them onto the spam folder.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Spam classification helps us to filter the unwanted emails from the email Inbox. There have been various attempts to classify the spam email based on using email header [20], [21], [5], [36], [37], [38], [13], [4], using email body [3], [41], [35], [29], [27], [30], [7], [31], [32], [33], [34] and also using both body and header [18], [23], [21], [15], [42] and statistical features [19], [25]. The email header classification is performed using techniques such as Naïve Bayes (NB), Decision Tree (DT) [40] [43], and Support Vector Machine (SVM) [23], [24], [20], [13], [26] Random Forest (RF) [4], [13].…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Effect of Header-based Features on Accuracy of Classifiers for Spam Email Classification

Kulkarni¹,

Jatinderkumar²,

Acharya³

2020

IJACSA

View full text Add to dashboard Cite

Emails are an integral part of communication in today's world. But Spam emails are a hindrance, leading to reduction in efficiency, security threats and wastage of bandwidth. Hence, they need to be filtered at the first filtering station, so that employees are spared the drudgery of handling them. Most of the earlier approaches are mainly focused on building content-based filters using body of an email message. Use of selected header features to filter spam, is a better strategy, which was initiated by few researchers. In this context, our research intends to find out minimum number of features required to classify spam and ham emails. A set of experiments was conducted with three datasets and five Feature Selection techniques namely Chi-square, Correlation, Relief Feature Selection, Information Gain, and Wrapper. Five-classification algorithms-Naïve Bayes, Decision Tree, NBTree, Random Forest and Support Vector Machine were used. In most of the approaches, a trade-off exists between improper filtering and number of features. Hence arriving at an optimum set of features is a challenge. Our results show that in order to achieve the objective of satisfactory filtering, minimum 5 and maximum 14 features are required.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

Effect of Header-based Features on Accuracy of Classifiers for Spam Email Classification

Kulkarni¹,

Jatinderkumar²,

Acharya³

2020

IJACSA

View full text Add to dashboard Cite

show abstract

“…Recently, the batch algorithms are being continuously improved. Some of them try to choose different base-classifiers, such as Decision Tree, Fuzzy Rule, K-nearest neighbor and so on [6][7][8]. Some of them try to choose different window error thresholds to improve the classification accuracy [9].…”

Section: Relevant Algorithmsmentioning

confidence: 99%

A fast learn++.NSE classification algorithm based on weighted moving average

Shen¹,

Zhu²,

Du³

et al. 2018

Filomat

View full text Add to dashboard Cite

Current researches of incremental classification learning algorithms mainly focus on learning from data in a stationary environment. The incremental learning in a non-stationary environment (NSE), where the underlying data probability distribution changes over time, however, has received much less attentions despite the abundant real applications have generated the long-term and cumulative big data in NSE. Thus, the incremental learning in NSE has gradually received extensive attentions. Nevertheless, the popular incremental classification learning algorithms currently for NSE such as SEA and DWM generally place strict restrictions on the changes. These algorithms can only deal with gradual drift and noncyclical and no new category situations. Therefore, it is highly necessary to develop a novel efficient incremental classification learning algorithm for the gradually cumulative big data in complex NSE. The recently proposed Learn++.NSE algorithm is an important research achievement in this field. However, the vote weight of each base-classifier of the Learn++.NSE depends on its whole error rates in the environments experienced. Therefore, the classification learning efficiency of the Learn++.NSE should be further improved. A novel fast Learn++.NSE algorithm based on weighted moving average (WMA-Learn++.NSE) is presented in this paper, which computes the weighted average of error rates using the sliding window technology to optimize the weight calculation. By only using the recent classification error rates of each base-classifier inside the sliding window to calculate the vote weight, the WMA-Learn++.NSE accelerates the compute of vote weight and improves the efficiency of classification learning. The verification experiments and performance analyses on both synthetic and real data set are presented in this paper. The experimental results show that the WMA-Learn++.NSE can achieve a higher execution efficiency compared to the Learn++.NSE in getting the equivalent classification correct rate.

show abstract

“…Denkowski et al (2014) describe a framework for building adaptive MT systems that learn from post-editor feedback, and 3) incremental learning for spam filtering, e.g. Sheu et al (2017) use a window-based technique to estimate for the condition of concept drift for each incoming new email.…”

Section: Introductionmentioning

confidence: 99%

Demonstrating Par4Sem - A Semantic Writing Aid with Adaptive Paraphrasing

Yimam

Biemann

2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

View full text Add to dashboard Cite

In this paper, we present PAR4SEM, a semantic writing aid tool based on adaptive paraphrasing. Unlike many annotation tools that are primarily used to collect training examples, PAR4SEM is integrated into a real word application, in this case a writing aid tool, in order to collect training examples from usage data. PAR4SEM is a tool, which supports an adaptive, iterative, and interactive process where the underlying machine learning models are updated for each iteration using new training examples from usage data. After motivating the use of ever-learning tools in NLP applications, we evaluate PAR4SEM by adopting it to a text simplification task through mere usage.

show abstract

An efficient incremental learning mechanism for tracking concept drift in spam filtering

Cited by 20 publications

References 26 publications

Effect of Header-based Features on Accuracy of Classifiers for Spam Email Classification

Effect of Header-based Features on Accuracy of Classifiers for Spam Email Classification

A fast learn++.NSE classification algorithm based on weighted moving average

Demonstrating Par4Sem - A Semantic Writing Aid with Adaptive Paraphrasing

Contact Info

Product

Resources

About