Spam filtering: data sets

Spam Data Sets (e-mails)

Trec 2007 Spam Track Data Set:
http://plg.uwaterloo.ca/~gvcormac/treccorpus07/

 

Our personal spam data set and others can be found in our spam repository
http://prag.diee.unica.it/public/datasets/spam/

 

Image Spam Data Sets (only attached images extracted from e-mails)

Our personal image ham and spam data sets and others can be found in our spam repository
http://prag.diee.unica.it/public/datasets/imageSpam/

 

1) Image spam/ham dataset used in:

M. Dredze, R.Gevaryahu, A. Elias-BachrachLearning fast classifiers for image spam,
Fourth conference on email and anti-spam, CEAS 2007, Mountain View,
California, August 2-3, 2007 (paper available at http://www.ceas.cc/).

Battista BiggioGiorgio FumeraIgnazio PillaiFabio Roli"Improving Image Spam Filtering Using Image Text Features", Fifth Conference on Email and Anti-Spam (CEAS 2008), Mountain View, CA, USA, 21/08/2008. 

F. Gargiulo, C. Sansone, "Visual and OCR-based Features for detecting Image Spam", in A. Juan-Císcar and G. Sanchez-Albaladejo (Eds.), Pattern Recognition in Information Systems, INSTICC Press, pp. 154-163, 2008.  

Image Ham

2006 images attached to ham emails collected by M.Dredze during 2007. We have removed some corrupted images.

Image Spam

3297 images attached to spam emails collected by M.Dredze during 2007. We have removed some corrupted images.

 

2) Image spam dataset used in:  

Battista BiggioGiorgio FumeraIgnazio PillaiFabio Roli"Improving Image Spam Filtering Using Image Text Features", Fifth Conference on Email and Anti-Spam (CEAS 2008), Mountain View, CA, USA, 21/08/2008. 

PRA Group Personal Image Spam Corpus

8549 images attached to spam emails received by mail.diee.unica.it during 2004-2007

 

3) Another corpora, used in:


F. Gargiulo, C. Sansone, "Visual and OCR-based Features for detecting Image Spam", in A. Juan-Císcar and G. Sanchez-Albaladejo (Eds.), Pattern Recognition in Information Systems, INSTICC Press, pp. 154-163, 2008. 

F. Gargiulo, A. Penta, A. Picariello, C. Sansone, "A Behaviour-Knowledge Space Approach for Spam Detection",
in Proceedings of the 2nd Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications
(SUEMA 2008), Patrasso, Greece, July 21-22, 2008 pp. 16-20.

F. Gargiulo, A. Penta, A. Picariello, C. Sansone, "Using heterogeneous features for anti-spam filters",
in Proceedings of the 3rd International Workshop on Flexible Database and Information System
Technology (FlexDBIST-08),
Turin, September, 1-5, IEEE Computer Society Press, 2008 (in press).

F. Gargiulo, C. Sansone, "Combining visual
and textual features for filtering spam emails",
in Proceedings of the 19th International Conference on Pattern Recognition, Tampa, USA, December 8-11,
IEEE Computer Society Press, 2008 (in press). 

Image Ham

151 images attached to legitimate emails received by mailserverstudenti.unina.it during the period 2005-2007.

Image Spam Emails

20292 spam emails with images received by mailserverstudenti.unina.it during the period 2005-2007.