Pattern Analysis and Machine Intelligence
The IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) is published monthly. Its Editorial Board strives to publish papers that present important research results within PAMI's scope. These include statistical and structural pattern recognition; image analysis; computational models of vision; computer vision systems; enhancement, restoration, segmentation, feature extraction, shape and texture analysis; applications of pattern analysis in medicine, industry, government, and the arts and sciences; artificial intelligence, knowledge representation, logical and probabilistic inference, learning, speech recognition, character and text recognition, syntactic and semantic processing, understanding natural language, expert systems, and specialized architectures for such processing.
Updated: 1 year 7 weeks ago
To correct geometric distortion and reduce space and time-varying blur, a new approach is proposed in this paper capable of restoring a single high-quality image from a given image sequence distorted by atmospheric turbulence. This approach reduces the space and time-varying deblurring problem to a shift invariant one. It first registers each frame to suppress geometric deformation through B-spline based non-rigid registration. Next, a temporal regression process is carried out to produce an image from the registered frames, which can be viewed as being convolved with a space invariant near-diffraction-limited blur. Finally, a blind deconvolution algorithm is implemented to deblur the fused image, generating a final output. Experiments using real data illustrate that this approach can effectively alleviate blur and distortions, recover details of the scene and significantly improve visual quality.
PrePrint: A Novel Bayesian Framework for Discriminative Feature Extraction in Brain-Computer Interfaces
As there has been a paradigm shift in the learning load from a human subject to a computer, machine learning has been considered as a useful tool for Brain-Computer Interfaces (BCIs). In this paper, we propose a novel Bayesian framework for discriminative feature extraction for motor imagery classification in an EEG-based BCI, in which the class-discriminative frequency bands and the corresponding spatial filters are optimized by means of the probabilistic and information-theoretic approaches. In our framework, the problem of simultaneous spatio-spectral filter optimization is formulated as the estimation of an unknown posterior pdf that represents the probability that a single-trial EEG of predefined mental tasks can be discriminated in a state. In order to estimate the posterior pdf, we propose a particle-based approximation method by extending a factored-sampling technique with a diffusion process. An information-theoretic observation model is also devised to measure discriminative power of features between classes. From the viewpoint of classifier design, the proposed method naturally allows us to construct a spectrally weighted label decision rule by linearly combining the outputs from multiple classifiers. We demonstrate the feasibility and effectiveness of the proposed method by analyzing the results and its success on three public databases.
PrePrint: Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses
Detecting objects in cluttered scenes and estimating articulated human body parts from 2D images are two challenging problems in computer vision. The difficulty is particularly pronounced in activities involving human-object interactions (e.g. playing tennis), where the relevant objects tend to be small or only partially visible, and the human body parts are often self-occluded. We observe, however, that objects and human poses can serve as mutual context to each other - recognizing one facilitates the recognition of the other. In this paper we propose a mutual context model to jointly model objects and human poses in human-object interaction activities. In our approach, object detection provides a strong prior for better human pose estimation, while human pose estimation improves the accuracy of detecting the objects that interact with the human. On a six-class sports dataset and a 24-class people interacting with musical instruments dataset, we show that our mutual context model outperforms state-of-the-art in detecting very difficult objects and estimating human poses, as well as classifying human-object interaction activities.
Whole-book recognition is a document image analysis strategy that operates on the complete set of a book's page images using automatic adaptation to improve accuracy. The algorithm is initialized with approximate iconic and linguistic models---derived from (generally errorful) OCR results and (generally imperfect) dictionaries---and then, guided entirely by evidence internal to the test set, corrects the models which, in turn, yields higher recognition accuracy. It detects "disagreements" by measuring cross entropy between (1) the posterior probability distribution of character classes, and (2) the posterior probability distribution of word classes. We show how disagreements can identify candidates for model corrections at both the character and word levels. Experiments on passages up to one hundred and eighty pages long show that when a candidate model adaptation reduces whole-book disagreement, it is also likely to correct recognition errors. Also, the longer the passage operated on by the algorithm, the more reliable this adaptation policy becomes, and the lower the error rate achieved. Best results occur when both the iconic and linguistic models mutually correct one another. We have observed recognition error rates driven down by nearly an order of magnitude fully automatically without supervision (or indeed without any user intervention).
Hashing based approximate nearest neighbor (ANN) search in huge databases has become popular owing to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or inefficient. Moreover these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled set. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, non-orthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.
We consider the family of total Bregman divergences (tBDs) as an efficient and robust "distance" measure to quantify the dissimilarity between shapes.The tBD based L1-norm center is used as the representative of a set of shapes, called the t-center. We then prove that for any tBD, there exists a distribution which belongs to the lifted exponential family of distributions. Further, we show that finding the MAP estimate of the parameters of this family is equivalent to minimizing the tBD to find the t-centers. This leads to a new clustering technique namely, the total Bregman soft clustering algorithm. We evaluate the tBD, t-center and the soft clustering algorithm on shape retrieval applications. Our shape retrieval framework is composed of three steps: (1) extraction of the shape boundary points (2) affine alignment of the shapes and use of a Gaussian mixture model (GMM) to represent the aligned boundaries, and (3) comparison of the GMMs using tBD to find the best matches given a query shape. To further speed up the shape retrieval algorithm, we perform hierarchical clustering of the shapes using the tBD soft clustering algorithm. We evaluate our method on various public domain 2D and 3D databases, and demonstrate comparable or better results than state-of-the-art retrieval techniques.
This article proposes a novel similarity measure between vector sequences. We work in the framework of model-based approaches, where each sequence is first mapped to a Hidden Markov Model (HMM) and then a probabilistic measure of similarity is computed between the HMMs. We propose to model sequences with semi-continuous HMMs (SC-HMMs). This is a particular type of HMM whose emission probabilities in each state are mixtures of shared Gaussians. This crucial constraint provides two major benefits. First, the a priori information contained in the common set of Gaussians leads to a more accurate estimate of the HMM parameters. Second, the computation of a probabilistic similarity between two SC-HMMs can be simplified to a Dynamic Time Warping (DTW) between their mixture weight vectors, which reduces significantly the computational cost. Experiments are carried out on a handwritten word retrieval task in three different datasets - an in-house dataset of real handwritten letters, the George Washington dataset and the IFN/ENIT dataset of Arabic handwritten words. These experiments show that the proposed similarity outperforms the traditional DTW between the original sequences, and the model-based approach which uses ordinary continuous HMMs. We also show that this increase in accuracy can be traded against a significant reduction of the computational cost.
Categorizing videos of dynamic textures (DTs) (nonrigid dynamical objects such as fire, water, etc.) is an extremely challenging problem because of their continuous change in shape and appearance. State-of-the-art DT categorization methods have been successful at classifying videos taken from the same viewpoint and scale by using a linear dynamical system (LDS) to model each video and metrics between LDSs to classify them. However, these methods perform poorly when the videos are taken from different viewpoints or scales. In this paper, we propose a novel DT categorization framework that can handle these changes by modeling DTs with a collection of LDSs, each describing a small spatiotemporal patch extracted from the video. This Bag-of-Systems (BoS) representation is analogous to the Bag-of-Features (BoF) representation for object recognition, except that we use LDSs as feature descriptors. The space of LDSs, however, is not Euclidean, and hence methods for computing codewords of LDSs need to be developed. Our framework uses nonlinear dimensionality reduction, clustering techniques and distances for LDSs to tackle this issue. Our experiments show that our approach can be used for categorizing DTs in challenging scenarios, which could not be handled by existing methods.
Despite a wide range of feature detectors developed in the computer vision community over the years, direct application of these techniques to surgical navigation has shown significant difficulties due to the paucity of reliable salient features coupled with free-form tissue deformation and changing visual appearance of surgical scenes. The aim of this paper is to propose a novel probabilistic framework to track affine-invariant anisotropic regions under contrastingly different visual appearances during Minimally Invasive Surgery (MIS). The theoretical background of the affine-invariant anisotropic feature detector is presented and a real-time implementation exploiting the computational power of the GPU is proposed. An Extended Kalman Filter (EKF) parameterisation scheme is used to adaptively adjust the optimal templates of the detected regions, enabling accurate identification and matching of the tracked features. For effective tracking verification, spatial context and region similarity have also been incorporated. They are used to boost the prediction of the EKF and recover potential tracking failure due to drift or false positives. The proposed framework is compared to the existing methods and their respective performance is evaluated with in vivo video sequences recorded from robotic assisted MIS procedures, as well as real-world scenes.
PrePrint: Consensus Clustering Based on a New Probabilistic Rand Index with Application to Subtopic Retrieval
We introduce a probabilistic version of the well known Rand Index for measuring the similarity between two partitions, called Probabilistic Rand Index (PRI), in which agreements and disagreements at the object-pair level are weighted according to the probability of their occurring by chance. We then cast consensus clustering as an optimization problem of the PRI value between a target partition and a set of given partitions, experimenting with a simple and very efficient stochastic optimization algorithm. Remarkable performance gains over input partitions as well as over existing related methods are demonstrated through a range of applications, including a new use of consensus clustering to improve subtopic retrieval.
Hough transform based methods for detecting multiple objects use non-maxima suppression or mode-seeking to locate and distinguish peaks in Hough images. Such postprocessing requires tuning of many parameters and is often fragile, especially when objects are located spatially close to each other. In this paper, we develop a new probabilistic framework for object detection which is related to the Hough transform. It shares the simplicity and wide applicability of the Hough transform but at the same time, bypasses the problem of multiple peak identification in Hough images, and permits detection of multiple objects without invoking non-maximum suppression heuristics. Our experiments demonstrate that this method results in a significant improvement in detection accuracy both for the classical task of straight line detection and for a more modern category-level (pedestrian) detection problem.
PrePrint: Estimating Information from Image Colors: An Application to Digital Cameras and Natural Scenes
The colors present in an image of a scene provide information about its constituent elements. But the amount of information depends on the imaging conditions and on how information is calculated. This work had two aims. The first was to derive explicitly estimators of the information available and the information retrieved from the color values at each point in images of a scene under different illuminations. The second was to apply these estimators to simulations of images obtained with five sets of sensors used in digital cameras and with the cone photoreceptors of the human eye. Estimates were obtained for 50 hyperspectral images of natural scenes under daylight illuminants with correlated color temperatures 4000 K, 6500 K, and 25000 K. Depending on the sensor set, the mean estimated information available across images with the largest illumination difference varied from 15.5 to 18.0 bits and the mean estimated information retrieved after optimal linear processing varied from 13.2 to 15.5 bits (each about 85% of the corresponding information available). With the best sensor set, 390% more points could be identified per scene than with the worst. Capturing scene information from image colors depends crucially on the choice of camera sensors.
PrePrint: A Closed-form Solution to Intrinsic Image Decomposition with Retinex and Non-local Texture Constraints
We propose a method for intrinsic image decomposition based on Retinex theory and texture analysis. While most previous methods approach this problem by analyzing local gradient properties, our technique additionally identifies distant pixels with the same reflectance through texture analysis, and uses these non-local reflectance constraints to significantly reduce ambiguity in decomposition. We formulate the decomposition problem as the minimization of a quadratic function which incorporates both the Retinex constraint and our non-local texture constraint. This optimization can be solved in closed form with the standard conjugate gradient algorithm. Extensive experimentation validates our method with comparisons to previous techniques in terms of both decomposition accuracy and runtime efficiency.
PrePrint: Discriminative Multi-Manifold Analysis for Face Recognition from A Single Training Sample per Person
Conventional appearance-based face recognition methods usually assume that there are multiple samples per person (MSPP) available during the training phase for discriminative feature extraction. In many practical face recognition applications such as law enhancement, e-passport and ID card identification, this assumption, however, may not hold as there is only a single sample per person (SSPP) enrolled or recorded in these systems. Many popular face recognition methods fail to work well in this scenario because there are not enough samples for discriminant learning. To address this problem, we propose in this paper a novel discriminative multi-manifold analysis (DMMA) method by learning discriminative features from image patches. First, we partition each enrolled face image into several non-overlapping patches to form an image set for each sample per person. Then, we formulate the SSPP face recognition as a manifold-manifold matching problem and learn multiple DMMA feature spaces to maximize the manifold margins of different persons. Lastly, we propose a reconstruction-based manifold-manifold distance to identify the unlabeled subjects. Experimental results on three widely used face databases are presented to demonstrate the efficacy of the proposed approach.
Learning from multi-view data is important in many applications such as image classification, retrieval and annotation. Standard predictive methods, such as support vector machines that are built with all the variables available without taking into consideration the presence of distinct views, would sacrifice predictive performance and may also be incapable of performing view-level analysis. In this paper, we present a statistical method to learn a predictive subspace representation shared by multiple views when supervising side information is provided and perform view-level predictions. Our approach is based on a multi-view latent subspace Markov network (MN) which fulfills a weak conditional independence assumption that multi-view observations and response variables are conditionally independent given a set of latent variables. To learn the latent subspace multi-view MN, we develop a large-margin approach which jointly maximizes data likelihood and minimizes a prediction loss on training data. The learning and inference problems are efficiently solved with a contrastive divergence method. Finally, we extensively evaluate the large-margin multi-view latent subspace MN on real TRECVID video, Flickr web image and hotel review datasets for classification, regression, image annotation and retrieval. Our results demonstrate that the large-margin approach can achieve significant improvements in terms of prediction performance and discovering predictive latent subspace representations.
We present a study of the in-camera image processing through an extensive analysis of more than 10,000 images from over 30 cameras. The goal of this work is to investigate if image values can be transformed to physically meaningful values, and if so, when and how this can be done. From our analysis, we found a major limitation of the imaging model employed in conventional radiometric calibration methods and propose a new in-camera imaging model that fits well with today's cameras. With the new model, we present associated calibration procedures that allow us to convert an sRGB images back to their original CCD RAW responses in a manner that is significantly more accurate than any existing methods. Additionally, we show how this new imaging model can be used to build an image correction application that converts an sRGB input image captured with the wrong camera settings to an sRGB output image that would have been recorded under the correct settings of a specific camera.
We propose an automatic approximation of the intrinsic manifold for general semisupervised learning problems. Unfortunately, it is not trivial to define an optimization function to obtain optimal hyperparameters. Usually, pure cross-validation is considered but it does not necessarily scale up. A second problem derives from the suboptimality incurred by discrete grid search and overfitting problems. Therefore, we develop an ensemble manifold regularization (EMR) framework to approximate the intrinsic manifold by combining several initial guesses. Algorithmically, we designed EMR carefully so it (a) learns both the composite manifold and the semi-supervised learner jointly, (b) is fully automatic for learning the intrinsic manifold hyperparameters implicitly, (c) is conditionally optimal for intrinsic manifold approximation under a mild and reasonable assumption, and (d) is scalable for a large number of candidate manifold hyperparameters, from both time and space perspectives. Furthermore, we prove the convergence property of EMR to the deterministic matrix at rate root-n. Extensive experiments over both synthetic and real datasets demonstrate the effectiveness of the proposed framework.
We propose a novel robust estimation algorithm - the generalized projection based M-estimator (gpbM), which does not require the user to specify any scale parameters. The algorithm is general and can handle heteroscedastic data with multiple linear constraints for single and multi-carrier problems. The gpbM has three distinct stages -- scale estimation, robust model estimation and inlier/outlier dichotomy. In contrast, in its predecessor pbM, each model hypotheses was associated with a different scale estimate. For data containing multiple inlier structures with generally different noise covariances, the estimator iteratively determines one structure at a time. The model estimation can be further optimized by using Grassmann manifold theory. We present several homoscedastic and heteroscedastic synthetic and real-world computer vision problems with single and multiple carriers.
Spectral matching is a computationally efficient approach to the approximate solution of pairwise matching problems that are np-hard. In this work we present a probabilistic interpretation of spectral matching schemes and derive a novel probabilistic matching scheme that is shown to outperform previous approaches. We show that spectral matching can be interpreted as a maximum likelihood estimate of the assignment probabilities and that the Graduated Assignment algorithm can be cast as a Maximum a Posteriori estimator. Based on this analysis we derive a ranking scheme for spectral matchings based on their reliability, and propose a novel iterative probabilistic matching algorithm that relaxes some of the implicit assumption used in prior works. We experimentally show our approaches to outperforms previous schemes when applied to exhaustive synthetic tests, as well as the analysis of real image sequences.
It is quite common that multiple human observers attend to a single static interest point. This is known as a mutual awareness event (MAWE). A preferred way to monitor these situations is with a camera that captures the human observers while using existing face detection and head pose estimation algorithms. The current work studies the underlying geometric constraints of MAWEs and reformulates them in terms of image measurements. The constraints are then used in a method that (1) detects whether such an interest point does exist, (2) determines where it is located, (3) identifies who was attending to it, and (4) reports where and when each observer was while attending to it. The method is also applied on another interesting event when a single moving human observer fixates on a single static interest point. The method can deal with the general case of an uncalibrated camera in a general environment. This is in contrast to other work on similar problems that inherently assume a known environment or a calibrated camera. The method was tested on about 75 images from various scenes and robustly detects MAWEs and estimates their related attributes. Most of the images were found by searching the Internet.