Bayes OCR Plugin
What is it?
Bayes OCR Plugin is a plugin for the Spamassassin© spam filter, entirely developed by P.R.A. Group (Pattern Recognition and Applications Group) of Electric and Electronics Engineering Department (D.I.E.E.) of University of Cagliari.
Bayes OCR Plugin is designed to perform a per-content analysis of images attached to emails.
It is a common technique used by spammers to put the text message into images in a way that is easily readable by a human but very difficult to be correctly extracted by a common OCR software. Our goal is to help Spamassassin catch these spam messages also.
Please remind Bayes OCR Plugin is still beta!
I know other plugins using a OCR!
There's a wide choice of third party software to extend the functionalities of Spamassassin and, among them, some plugins do perform a per-content analysis of images through an OCR. We mention here two important examples.
OCR Plugin is a simple plugin that extracts the text from the image and performs a straight keyword search in order to decide whether the content is spam or ham. No other evaluation is made of the text but a simple count of the words known as "spam words", if the count is above a given threshold a hit is returned to Spamassassin for that email.
The weakness of this method is that most of the spam messages are covered inside the image by a series of noisy elements (such as transversal lines) so that the text extracted by an OCR is heavily affected. As a result, OCR text contains random errors and new spam messages are not blocked, so a more complex analysis is needed.
The technique used in Fuzzy OCR Plugin (See: http://fuzzyocr.own-hero.net/wiki/WhatisFuzzyOcr) is a slightly different approach. The keyword search matching algorithm counts not only the words that exactly match the known "spam words" but also the words that are close enough to them (that's for instance the case of the words viagra and vi@gra). The major improvement of this plugin is due to the concept of distance between words which is obtained by calculating the "Levenshtein edit distance": if the distance is shorter than a given value two words are considered the same word.
Why using Bayes OCR Plugin?
We believe the major weakness of the mentioned plugins is the key word search upon OCR text. This method has no generalization capability and do recognizes only those emails containing words of the key word list. Therefore, being strictly tied to the key word list it suffers from the same weakness of the simple keyword search on the standard text of emails.
In order to get a more flexible analysis, Bayes OCR Plugin tries to extend the Bayesian analysis (which is so far commonly performed only upon the text content of the email) to the text extracted from the images through an OCR.
Getting into details, text is extracted from the image (if present) by an OCR and is passed to a Bayesian classifier in order to perform the same analysis made on the standard text in the email.
So far, Bayes OCR Plugin uses the integrated Naive Bayes classifier used in Spamassassin. That means, the bayesian classifier is trained on the standard text and needs no further training, which is reasonable as spam text contained into images is commonly similar to the standard text.
Our initial tests (carried on our local database of a hundred mails circa) show that Bayes OCR Plugin can contribute efficiently to spam classification when an image is attached. Our tests show a significant drop in the number of false negatives (spam not being hit by Spamassassin), while the number of false positives (legitimate messages recognised as spam) is kept to zero.
We decided not to report numerical results for two reasons. First reason is that no standard database is available in literature as a common reference for performances for this kind of task. Second reason is we are currently working to get a more representative database.
The tests mentioned above have been performed on a dataset made of real spam emails including images, to this set we added a set of artificial generated legitimate mails. The legitimate mails were generated by attaching legitimate images with text to a subset of the Enron dataset (a standard for scientific publications in the field of mail classification).
A quantitative analysis of the technique we implemented for Bayes OCR Plugin and further considerations can be found in our academic publications about spam filtering.
Is it possible to contribute?
Our plugin is still under heavy development, the version available here is a beta test. Evolutions have been already planned and, more important, we still have to perform major tests.
Every single contribute to the test phase is extremely (;-)) welcome, either for the Perl code or the performance tests, just feel free to volunteer!
Regarding the performance tests, it would be of very much interest for us to perform tests on a set in English language containing a great amount of legitimate mails with images attached. This is just because most of the spam comes in English. If you know were to get such a dataset, please let us know. In case you have one but it's not publicly available we would be happy to know just the performances of Bayes OCR Plugin on it.
Suggestions at any level and any other contribute as a end-user are warmly welcome.
Download the files BayesOCR_PLG.cf and BayesOCR_PLG.pm from here: download.
Then just copy the two files into the local configuration folder of Spamassassin. Doing this you just have to restart Spamassassin to start working with Bayes OCR Plugin.
If needed, edit the configuration file BayesOCR_PLG.cf to set up your custom score. Remember to restart Spamassassin after any change.
Bayes OCR Plugin only needs working versions of Spamassassin, convert (imagemagick), identify (imagemagick) and gocr.
This software is released under the Apache Software License (version 2.0). Every improvement and redistribution is approved and warmly encouraged
Bayes OCR Plugin is provided "as is" without warranty of any kind. We don't assume any responsibility on the performances and any possible damage araising out of the use of the software. Use it at your own risk!