Spam filtering

Junk e-mails, or spam, can be defined as unsolicited, unwanted e-mails that are sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. The goal is usually advertising for some product (often illegal), making a fraud (for instance, the so-called phishing e-mails, aimed at stealing users personal information like username and password to access their bank accounts), or conveying computer malware. The growth of the spam phenomenon is essentially due to economical reasons: sending e-mails is so cheap that only a very small fraction of replies (even one in serveral thousands targeted recipients) make it profitable. The consequence is that to date the great majority of the e-mail traffic is due to spam. This poses several problems, among which security problems for end users and network infrastructure.
Among the different kinds of countermeasures proposed or adopted worldwide against spam (mainly legal, economical and technological), technological ones consist in spam filters, namely software aimed at detecting spam e-mails either at e-mail servers (preventing them to be sent to the recipients) or at e-mail clients (allowing spam e-mails to be stored in a "junk" folder distinct from the other user's folders).
Spam filtering can be defined as the task of discriminating between legitimate and spam e-mails on the basis of their content. This can be formulated as a pattern classification task, and indeed most of the current spam filters include machine learning techniques. However spam filtering is not a traditional pattern classification task, since spammers are intelligent, adaptive "adversaries" who continuously introduce new kinds of tricks just to evade spam filters. This is a kind of classification problem named adversarial classification, and is common to many security applications like intrusion detection in computer networks and biometric authentication and verification.
Our interest in spam filtering started in 2004, when spammers introduced a new trick to evade spam filters. Until about 2004, the spam message was written as text in the e-mail's body, often with tricks like misspelling typical "spammy" words or using complex HTML content to hide such words, with the aim of preventing spam filters to detect them. At that time spam filters analyzed only the textual content of e-mails. Spammers introduced then a new trick, known as "image-based spam", or image spam for short, consisting in embedding the spam message into attached images which were not analyzed by spam filters, and often introducing "bogus" words (looking as legitimate text) in the e-mail's body to circumvent text classifiers used in many spam filters (known as "bayesian" filters, as they were based on the well known Naive Bayes text classifier algorithm). Moreover, spammers often applied obfuscation techniques to text embedded into images, to prevent it to be read by OCR tools. This raised the issue of improving spam filters to make them capable to analyze not only the textual content of e-mails, but also other kinds of contents through which the spam message can be conveyed, and in particular images, which requires the use of computer vision and pattern recognition techniques.
Our first contribution to spam filtering was an investigation of the possibility and the effectiveness of detecting image spam by applying text categorization techniques both to the text in the e-mail's body and to the text extracted by OCR techniques from attached images, if any (see our 2006 paper on JMLR). This approach proved to be effective on clean images, like in the example below.
Fig.1: Example of a clean image attached to a real spam e-mail taken from our personal mailbox.
We then investigated the possibility of recognizing image spam with obfuscated images (like in the examples below), in which the text embedded into images can be very difficult to read by OCR tools, by detecting the presence into attached images of text with artifacts denoting an adversarial attempt to obfuscate it.
    
 
Fig.2: Examples of obfuscated images attached to real spam e-mails taken from our personal mailbox.
Finally, we investigated the effectiveness of generic low-level image features (like number of colours, prevalent colour coverage, image aspect ratio, text area) in discriminating images attached to legitimate e-mails from images attached to spam e-mails (either with clean or obfuscated text). Note, however, that here the task does not consist in labelling an attached image as spam or legitimate: this is an ill-posed task, since such labels can be assigned only to the whole e-mail, and an image is not spam or legitimate by itself, but it depends also on the context (the e-mails which it is attached to). Accordingly, the low-level information coming from images has to be integrated with information coming from the header and body to assign a label to the whole e-mail (as an example, if I send to a friend of mine an e-mail with an attached image coming from a spam e-mail, the e-mail would be a legitimate on for my friend). A possible architecture of a spam filter is thus the following:

 

 

As by-products of our work, we developed two plug-ins against image spam for the widely used open source SpamAssassin spam filter (http://spamassassin.apache.org), one based on a text classifier applied to text extracted from attached images by OCR tools, and one based on low-level image features (see the Products page of this section). We also collected a large corpus of spam e-mails received at our personal mailboxes (see the Data sets section). We finally developed a tool to generate artificial spam images (see the Prototypes page), which implements different kinds of text obfuscation techniques with a desired obfuscation degree, and is useful to carry out extensive experiments on techniques against image spam.

 

People working on this topic: 

Battista Biggio
Giorgio Fumera
Ignazio Pillai
Fabio Roli
Riccardo Satta