Document Categorisation

Document categorization is a broad research field that encompasses several task (like classification, filtering, retrieval, text extraction and recognition) related to content-based management and processing of several kind of documents in digital form, like document images acquired through a scanner (books, journals, invoices, etc.), web pages, e-mails, etc. Machine learning and pattern recognition techniques are widely used in document categorization tasks.

Our interests on this topic are focused on text categorization (labeling text documents written in natural language with thematic categories from a predefined set) and on spam filtering. We are also working on a project related to automatic text extraction and recognition from scanned images of document forms like invoices and tax payment receipts (see the Project section of this page).

People working on this topic:

  • Battista Biggio
  • Giorgio Fumera
  • Ignazio Pillai
  • Fabio Roli
  • Riccardo Satta

Publications on Document Categorisation

Journal Article
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli , "A survey and experimental evaluation of image spam filtering techniques", Pattern Recognition Letters, vol. 32, issue 10, pp. 1436 - 1446, 2011  .
 Export: Tagged XML BibTex
Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Spam filtering based on the analysis of text information embedded into images", Journal of Machine Learning Research (special issue on Machine Learning in Computer Security), vol. 7, pp. 2699-2720, 12/2006. Abstract
 Export: Tagged XML BibTex
Conference Paper
Ignazio Pillai, Giorgio Fumera, Fabio Roli , "Classifier Selection Approaches for Multi-label Problems", 10th Int. Workshop on Multiple Classifier Systems (MCS 2011), Naples, Springer, 15/06/2011. Abstract
 Export: Tagged XML BibTex
Ignazio Pillai, Giorgio Fumera, Fabio Roli , "A Classification Approach with a Reject Option for Multi-label Problems", 16th Int. Conf. on Image Analysis and Processing (ICIAP 2011), Ravenna, Italy, 14/09/2011. Abstract
 Export: Tagged XML BibTex
Ignazio Pillai, Riccardo Satta, Giorgio Fumera, Fabio Roli , "Exploiting Depth Information for Indoor-Outdoor Scene Classification", 16th Int. Conf. on Image Analysis and Processing (ICIAP 2011), Ravenna, Italy, 14/09/2011. Abstract
 Export: Tagged XML BibTex
Jun-Ming Xu, Giorgio Fumera, Fabio Roli, Zhi-Hua Zhou , "Training SpamAssassin with Active Semi-supervised Learning", 6th Conference on Email and Anti-Spam (CEAS 2009), Mountain View, CA, USA, 16/07/2009.
 Export: Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Fabio Roli , "Adversarial Pattern Classification Using Multiple Classifiers and Randomisation", 12th Joint IAPR International Workshop on Structural and Syntactic Pattern Recognition (SSPR 2008), Orlando, Florida, USA, Springer-Verlag, 04/12/2008.
 Export: Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Fabio Roli , "Evade Hard Multiple Classifier Systems", Workshop on Supervised and Unsupervised Ensemble Methods and Their Applications (SUEMA 2008), Patras, Greece, 21/07/2008.
 Export: Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Improving Image Spam Filtering Using Image Text Features", Fifth Conference on Email and Anti-Spam (CEAS 2008), Mountain View, CA, USA, 21/08/2008.
 Export: Tagged XML BibTex
Giorgio Fumera, Ignazio Pillai, Fabio Roli, Battista Biggio , "Image spam filtering using textual and visual information", MIT Spam Conference 2007, Cambridge, MA, USA, 30/03/2007. Abstract
 Export: Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Image Spam Filtering Using Visual Information", 14th Int. Conf. on Image Analysis and Processing (ICIAP 2007), Modena, Italy, IEEE Computer Society, pp. 105--110, 10/09/2007.
 Export: Tagged XML BibTex
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Image Spam Filtering by Content Obscuring Detection", Fourth Conference on Email and Anti-Spam (CEAS 2007), Microsoft Research Silicon Valley, Mountain View, California, 02/08/2007. Abstract
 Export: Tagged XML BibTex
Fabio Roli, Battista Biggio, Giorgio Fumera, Ignazio Pillai, Riccardo Satta , "Image Spam Filtering by Detection of Adversarial Obfuscated Text", Workshop on Neural Information Processing Systems (NIPS), Whistler, British Columbia, Canada, 08/12/2007.
 Export: Tagged XML BibTex
Giorgio Fumera, Ignazio Pillai, Fabio Roli , "A Two-Stage Classifier with Reject Option for Text Categorisation", 5th Int. Workshop on Statistical Techniques in Pattern Recognition (SPR 2004), vol. 3138, Lisbon, Portugal, Springer, pp. 771-779, 18/08/2004. Abstract
 Export: Tagged XML BibTex
Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Classification with Reject Option in Text Categorisation Systems", 12th International Conference on Image Analysis and Processing (ICIAP 2003), Mantova, IEEE Computer Society, pp. 582-587, 17/09/2003. Abstract
 Export: Tagged XML BibTex
Thesis
Ignazio Pillai , "High Reliability Text Categorisation Systems.", DIEE, Cagliari (Italy), pp. 90, 2007  .
 Export: Tagged XML BibTex
Miscellaneous
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli, Riccardo Satta , "Evading SpamAssassin with obfuscated text images", Virus Bulletin, 11/2007.
 Export: Tagged XML BibTex