Document categorization is a broad research field that encompasses several task (like classification, filtering, retrieval, text extraction and recognition) related to content-based management and processing of several kind of documents in digital form, like document images acquired through a scanner (books, journals, invoices, etc.), web pages, e-mails, etc. Machine learning and pattern recognition techniques are widely used in document categorization tasks.
Our interests on this topic are focused on text categorization (labeling text documents written in natural language with thematic categories from a predefined set) and on spam filtering. We are also working on a project related to automatic text extraction and recognition from scanned images of document forms like invoices and tax payment receipts (see the Project section of this page).
People working on this topic:
- Battista Biggio
- Giorgio Fumera
- Ignazio Pillai
- Fabio Roli
- Riccardo Satta
Publications on Document Categorisation
Journal Article
Giorgio Fumera, Ignazio Pillai, Fabio Roli ,
"Spam filtering based on the analysis of text information embedded into images",
Journal of Machine Learning Research (special issue on Machine Learning in Computer Security), vol. 7, pp. 2699-2720, 12/2006.
Abstract
Export:
Tagged XML BibTex
Conference Paper
Battista Biggio, Giorgio Fumera, Fabio Roli ,
"Adversarial Pattern Classification Using Multiple Classifiers and Randomisation",
12th Joint IAPR International Workshop on Structural and Syntactic Pattern Recognition (SSPR 2008), Orlando, Florida, USA, Springer-Verlag, 04/12/2008.
Export:
Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Fabio Roli ,
"Evade Hard Multiple Classifier Systems",
Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (SUEMA 2008), Patras, Greece, 21/07/2008.
Export:
Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli ,
"Image Spam Filtering Using Visual Information",
14th Int. Conf. on Image Analysis and Processing (ICIAP 2007), Modena, Italy, IEEE Computer Society, pp. 105--110, 10/09/2007.
Export:
Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli ,
"Image Spam Filtering by Content Obscuring Detection",
Fourth Conference on Email and Anti-Spam (CEAS 2007), Microsoft Research Silicon Valley, Mountain View, California, 02/08/2007.
Abstract
Export:
Tagged XML BibTex
Fabio Roli, Battista Biggio, Giorgio Fumera, Ignazio Pillai, Riccardo Satta ,
"Image Spam Filtering by Detection of Adversarial Obfuscated Text",
Workshop on Neural Information Processing Systems (NIPS), Whistler, British Columbia, Canada, 08/12/2007.
Export:
Tagged XML BibTex
Giorgio Fumera, Ignazio Pillai, Fabio Roli ,
"A Two-Stage Classifier with Reject Option for Text Categorisation",
5th Int. Workshop on Statistical Techniques in Pattern Recognition (SPR 2004), vol. 3138, Lisbon, Portugal, Springer, pp. 771-779, 18/08/2004.
Abstract
Export:
Tagged XML BibTex
Giorgio Fumera, Ignazio Pillai, Fabio Roli ,
"Classification with Reject Option in Text Categorisation Systems",
12th International Conference on Image Analysis and Processing (ICIAP 2003), Mantova, IEEE Computer Society, pp. 582-587, 17/09/2003.
Abstract
Export:
Tagged XML BibTex
Thesis
Miscellaneous