Spam filtering: bibliography
In the machine learning and pattern recognition literature the spam filtering problem was originally formulated as a text categorization problem, since spam e-mails were mainly textual messages without any trick to obfuscate text.
Spam filtering techniques based on text categorization algorithms were proposed and investigated by several authors (Sahami et al., 1998; Drucker et al., 1999; Graham, 2004). These techniques were named "bayesian filters", since they were based on or derived from the widely used Naive bayes text classifier.Variants of these techniques were also proposed, based for instance on on-line learning methods (to take into account che variability of the characteristics of spam e-mails due to new spam campaigns or to new spammers' tricks), or targeted against specific kinds of spammers' tricks (see for instance the work by Jorgensen et al., 2008), or based on adaptive statistical data compression models (Bratko et al., 2006).
Some authors also proposed methods against image spam not based on text categorization techniques, but instead on low-level image features and features related to the presence of embedded text, like the relative area of the image occupied by text (Aradhye et al., 2005; Wu et al., 2005, Drezde et al., 2007).
Some works in the machine learning literature recently addressed the task of spam filtering as an instance of adversarial classification problems (as intrusion detection in computer networks and biometric authentication and verification), namely problems in which an intelligent, adaptive adversary exploits his knowledge on a classifier to modify his samples with the aim of evading the classifier itself (Dalvi et al., 2004; Graham-Cumming, 2004; Androutsopoulos et al., 2005).
Given that spam filtering is an arms race evolving at a very fast pace, many resources are available on-line besides than in published papers. We just mention here the Web site by J. Graham-Cumming, http://www.jgc.org, where the author mantained an updated list of tricks used by spammers, named "The Spammers' Compendium" (now it is being mantained by Virus Bulletin, here).
For a comprehensive and detailed overview of spam filtering techniques, we refer the reader to the recent work by G.V. Cormack (2007).
References
I. Androutsopoulos, E. F. Magirou, and D. K. Vassilakis, "A game theoretic model of spam e-mailing," in Proc. of 2nd Conf. on Email and Anti-Spam (CEAS 2005), 2005.
H. B. Aradhye, G. K. Myers, and J. A. Herson, "Image analysis for efficient categorization of image-based spam e-mail," in Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR‚ 2005), 2005.
A. Bratko, G. V. Cormack, B. Filipic, T. R. Lynam, and B. Zupan, "Spam filtering using statistical data compression models," Journal of Machine Learning Research, vol. 7, pp. 2673-2698, December 2006.
G.V. Cormack, "Email Spam Filtering: A Systematic Review", Foundations and Trends in Information Retrieval, Vol. 1, No. 4 (2006) 335-455.
N. N. Dalvi, P. M. Domingos, S. K. Sanghai, and D. Verma, "Adversarial classification," in Proc. Int. Conf. on Knowledge and Data Discovery, (W. Klm, R. Kohavi, J. Gehrke, and W. DuMouchel, eds.), pp. 99-108, 2004.
M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, "Learning fast classifiers for image spam," in Proc. Third Conf. on Email and Anti Spam (CEAS 2007), 2007.
H. Drucker, D. Wu, and V. N. Vapnik, "Support vector machines for spam categorization," IEEE Trans. on Neural Networks, vol. 10, no. 5, pp. 1048-1054, 1999.
P. Graham, "Better Bayesian Filtering", http://www.paulgraham.com/better.html, 2004.
J. Graham-Cumming, "How to beat an adaptive spam filter", in Proc. of MIT Spam Conference, 2004.
Z. Jorgensen, Y. Zhou, M. Inge, "A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters", Journal of Machine Learning Research, vol. 8, pp. 1115-1146, 2008.
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A Bayesian approach to filtering junk e-mail," in Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05, 1998.
C.-T. Wu, K.-T. Cheng, Q. Zhu and Yi-Leh Wu, "Using visual features for anti-spam filtering," Proc. Int. Conf. on Image Processing (ICIP 2005), 2005.