Plug-in for the Spamassassin© spam filter

Bayes OCR Plug-in is designed to perform a per-content analysis of images attached to e-mails. It is a common technique used by spammers to put the text message into images in a way that is easily readable by a human but very difficult to be correctly extracted by a common OCR software. Our goal is to help Spamassassin catch these spam messages also.
Please remind Bayes OCR Plugin is still beta!
Other plug-ins using an OCR system.
There's a wide choice of third party software to extend the functionalities of Spamassassin and, among them, some plugins do perform a per-content analysis of images through an OCR. We mention here two important examples.
OCR Plugin is a simple plug-in that extracts the text from the image and performs a straight keyword search in order to decide whether the content is spam or ham. No other evaluation is made of the text but a simple count of the words known as "spam words", if the count is above a given threshold a hit is returned to Spamassassin for that e-mail.
The weakness of this method is that most of the spam messages are covered inside the image by a series of noisy elements (such as transversal lines) so that the text extracted by an OCR is heavily affected. As a result, OCR text contains random errors and new spam messages are not blocked, so a more complex analysis is needed.
The technique used in Fuzzy OCR Plugin is a slightly different approach. The keyword search matching algorithm counts not only the words that exactly match the known "spam words" but also the words that are close enough to them (that's for instance the case of the words viagra and vi @gra). The major improvement of this plug-in is due to the concept of distance between words which is obtained by calculating the "Levenshtein edit distance": if the distance is shorter than a given value, two words are considered the same word.
Why using Bayes OCR Plug-in?
We believe the major weakness of the mentioned plugins is the key word search upon OCR text. This method has no generalization capability and do recognizes only those e-mails containing words of the key word list. Therefore, being strictly tied to the key word list it suffers from the same weakness of the simple keyword search on the standard text of e-mails.
In order to get a more flexible analysis, Bayes OCR Plug-in tries to extend the Bayesian analysis (which is so far commonly performed only upon the text content of the e-mail) to the text extracted from the images through an OCR.
Getting into details, text is extracted from the image (if present) by an OCR and is passed to a Bayesian classifier in order to perform the same analysis made on the standard text in the e-mail.
So far, Bayes OCR Plug-in uses the integrated Naive Bayes classifier used in Spamassassin. That means, the bayesian classifier is trained on the standard text and needs no further training, which is reasonable as spam text contained into images is commonly similar to the standard text.
Our initial tests (carried on our local database of a hundred mails circa) show that Bayes OCR Plug-in can contribute efficiently to spam classification when an image is attached. Our tests show a significant drop in the number of false negatives (spam not being hit by Spamassassin), while the number of false positives (legitimate messages recognised as spam) is kept to zero.
We decided not to report numerical results for two reasons. First reason is that no standard database is available in literature as a common reference for performances for this kind of task. Second reason is we are currently working to get a more representative database.
The tests mentioned above have been performed on a dataset made of real spam e-mails including images, to this set we added a set of artificial generated legitimate mails. The legitimate mails were generated by attaching legitimate images with text to a subset of the Enron dataset (a standard for scientific publications in the field of mail classification).
A quantitative analysis of the technique we implemented for Bayes OCR Plug-in and further considerations can be found in our academic publications about spam filtering.
- Download the files and from button on the right
- Copy the two files into the local configuration folder of Spamassassin
- Restart Spamassassin to start working with Bayes OCR Plug-in
- If needed, edit the configuration file to set up your custom score. Remember to restart Spamassassin after any change
Bayes OCR Plug-in only needs working versions of Spamassassin, convert (imagemagick), identify (imagemagick) and gocr.
This software is released under the Apache Software License (version 2.0). Every improvement and redistribution is approved and warmly encouraged.
Bayes OCR Plug-in is provided "as is" without warranty of any kind. We don't assume any responsibility on the performances and any possible damage araising out of the use of the software. Use it at your own risk!