Classification of Textual E‐Mail Spam Using Data Mining Techniques
Journal
Applied Computational Intelligence and Soft Computing
ISSN
1687-9732
Date Issued
2011-01
Author(s)
Saadat A. Nazirova
Institute of Information and Technology
Institute of Information & Technology
Editor(s)
Sebastian Ventura
Abstract
A new method for clustering of spam messages collected in bases of antispam system is offered. The genetic algorithm is developedfor solving clustering problems. The objective function is a maximization of similarity between messages in clusters, which isdefined by k-nearest neighbor algorithm. Application of genetic algorithm for solving constrained problems faces the problemof constant support of chromosomes which reduces convergence process. Therefore, for acceleration of convergence of geneticalgorithm, a penalty function that prevents occurrence of infeasible chromosomes at ranging of values of function of fitness isused. After classification, knowledge extraction is applied in order to get information about classes. Multidocument summarizationmethod is used to get the information portrait of each cluster of spam messages. Classifying and parametrizing spam templates,it will be also possible to define the thematic dependence from geographical dependence (e.g., what subjects prevail in spammessages sent from certain countries). Thus, the offered system will be capable to reveal purposeful information attacks if thoseoccur. Analyzing origins of the spam messages from collection, it is possible to define and solve the organized social networks of spammers.
