Technological advancements have made individuals and organizations more dependent on e-mails to communicate and share information. The increasing use of e-mails has led to an increased production of unsolicited commercial messages, known as spam. Spam classification systems able to self-adapt over time, with no human intervention, are rare. Adaptation is interesting as spams vary over time due to the use of different message-masking techniques. Moreover, classification models that handle large volumes of data are essential. Evolving intelligent systems are able to adapt their parameters and structure according to the data stream. This study applies the evolving methods TEDA (Typicality and Eccentricity based Data Analytics) and FBeM (Fuzzy Set-Based Evolving Modeling) for online unsupervised classification of spams. TEDA and FBeM are compared in terms of accuracy, model compactness, and processing time. For dimensionality reduction, a non-parametric Spearman-correlation-based feature selection method is employed. A dataset containing 25,745 samples, being 7,830 spams and 17,915 legitimate e-mails, is considered. 711 features extracted from an e-mail server describe each sample. Resumo: O avanço de tecnologias tem tornado indivíduos e organizações mais dependentes de e-mails para comunicação e compartilhamento de informação. O uso crescente de e-mail tem levadoà produção de mensagens comerciais não-solicitadas, conhecidas como spam. Sistemas de classificação de spams capazes de se adaptar ao longo do tempo, sem intervenção humana, são raros. A adaptaçãoé interessante já que spams variam no tempo devido ao uso de diferentes técnicas de mascaramento de mensagens. Além disso, modelos classificadores que lidam com grandes volumes de dados são essenciais. Sistemas inteligentes evolutivos são capazes de adaptar seus parâmetros e estrutura de acordo com o fluxo de dados. Este estudo aplica os métodos evolutivos TEDA (Typicality and Eccentricity based Data Analytics) e FBeM (Fuzzy Set-Based Evolving Modeling) para classificação não-supervisionada de spam. TEDA e FBeM são comparados em termos de acurácia, compactação do modelo, e tempo de processamento. Para redução da dimensionalidade, um método não-paramétrico baseado em correlação de Spearmań e empregado. Uma base de dados contendo 25745 amostras, sendo 7830 spams e 17915 e-mails legítimos, foi elaborada. 711 atributos extraídos de um servidor descrevem as amostras.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.