Spam filtering

Spam e-mails are unsolicited, unwanted e-mails that are sent indiscriminately by a sender having no current relationship with the recipient. The goal is usually advertising for some product (often illegal), perpetrating a fraud (for instance, the so-called phishing e-mails, aimed at stealing users personal information), or conveying computer malware. Legal, economical and technological countermeasures have been proposed or adopted worldwide against spam. The latter one consist of spam filters, namely software aimed at discriminating between legitimate and spam e-mails. E-mails recognised as spam are usually labelled as such by e-mail servers, and moved to a separate "junk" folder by e-mail clients. Pattern recognition techniques are widely used by spam filters for content-based analysis and classification of e-mail text and attachments. This is however a challenging task, since spammers devised several tricks to "obfuscate" the spam content so that it gets undetected by automatic analysis, but remains readable by humans.

PRA Lab effort in spam filtering focused on “image-based spam” (image spam for short). It is a trick that consists of embedding the spam message into attached images, to prevent its detection by text-based filters. Sometimes the text embedded into images is obfuscated, to undermine OCR-based filters.

We developed image spam filtering techniques based on the analysis of the text in the e-mail's body and of the one extracted by OCR tools; we also developed techniques aimed at detecting the presence of typical artefacts used for obfuscating text embedded into images.

We embedded our techniques into two plug-ins (BayesOCR, Image Cerberus) of the widely used open source SpamAssassin spam filter.