430 likes | 579 Views
Multi-Class Object Localization by Combining Local Contextual Interactions. Carolina Galleguillos , Brian McFee , Serge Belongie , Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department University of California, San Diego. Outline.
E N D
Multi-Class Object Localization by Combining Local Contextual Interactions Carolina Galleguillos, Brian McFee, Serge Belongie, GertLanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department University of California, San Diego
Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion
Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion
Introduction • Object localization of contextual cues can greatly improve accuracy over model that use appearance feature alone. • Context considers information from neighboring area of object, such as pixel, region, and object interaction.
Introduction • In this work, we present a novel framework for object localization that efficiently and effectively combines different level of interaction. • Develop a multiple kernel learning algorithm to integrate appearance feature with pixel and region interaction data, resulting in a unified similarity metric, which is optimized for nearest neighbor classification. • Object level interactions are modeled by a conditional random field(CRF) to produce the final label prediction.
Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion
Multi-Class Multi-Kernel Approach • Large Margin Nearest Neighbor • Multiple Kernel Extension • Spatial Smoothing by Segment Merging • Contextual Conditional Random Field
Multi-Class Multi-Kernel Approach • In our model, each training image I is partitioned into segmentssi by using ground truth information. • Each segment sicorresponds to exactly one object of class where C is the set of all object labels. • These segments are collected into the training set S. • For each segment si2S,,we extract several types of features, where the pth feature space is characterized by a kernel functionand inner product matrix:
Multi-Class Multi-Kernel Approach • From this collection of kernels, we learn a unified similarity metric over , and a corresponding embedding function , map training set to learned space. • To provide more representative examples for nearest neighbor prediction, we augment the training set S with additional segments , obtained by running a segmentation algorithm multiple times on the training images [24]. • Because at test time, ground-truth segmentations are not available, the test image must be segmented automatically.
Multi-Class Multi-Kernel Approach Multiple Kernel Extension – several different features are extracted Segment are mapped into a unified space & soft label prediction is compute Contextual Conditional Random Field – predict the final labeling of each segment Spatial Smoothing by Segment Merging
Multi-Class Multi-Kernel Approach • Large Margin Nearest Neighbor • Multiple Kernel Extension • Spatial Smoothing by Segment Merging • Contextual Conditional Random Field
Large Margin Nearest Neighbor • Our classification algorithm is based on k-nearest neighbor prediction. • Apply the Large Margin Nearest Neighbor(LMNN) algorithm to optimally distort the features for nearest neighbor prediction [35]. • Neighbors are selected by using the learned Mahalanobis distance metric W : • W is a positive semidefinite(PSD) matrix.
Large Margin Nearest Neighbor • W is trained so that for each training segment. • Neighboring segments (in feature space) with differing labels are pushed away by a large margin. • Achieved by solving the following semidefiniteprogram: and is similar and dissimilar label is slack parameter, is slack variable
Large Margin Nearest Neighbor • Alinear projection matrix L can be recovered from W by its spectral decomposition, so that W = L: V contains the eigenvectors of W, and is a diagonal matrix containing the eigenvalues
Large Margin Nearest Neighbor • Although the learned projection is linear, the algorithm can be kernelized [28] to effectively learn non-linear feature transformations. • After kernelizing the algorithm, each segment sican be rewritten by its corresponding column in the kernel matrix 1111111 and introducing a regularization term . • The embedding function then takes the form:
Multi-Class Multi-Kernel Approach • Large Margin Nearest Neighbor • Multiple Kernel Extension • Spatial Smoothing by Segment Merging • Contextual Conditional Random Field
Multiple Kernel Extension • To effectively integrate different types of feature descriptions, we learn a linear projection from each kernel’s feature space. • Define the combined distance between two points by summing the distance in each (transformed) space. This is expressed algebraically as: • The regularization term tr(WK) is similarly extended to the sum • The multiple-kernel embedding function then takes the form
Multiple Kernel Extension • Multiple Kernel LMNN(MKLMNN) algorithm:
Multiple Kernel Extension • The probability distribution over the labels for the segment is computed by using its k nearest neighbors , weighted according to distance from g(s0): where is the label of segment • To simplify the process, we restrict to be diagonal, which can be interpreted as learning weightings over S in each feature space.
Multi-Class Multi-Kernel Approach • Large Margin Nearest Neighbor • Multiple Kernel Extension • Spatial Smoothing by Segment Merging • Contextual Conditional Random Field
Spatial Smoothing by Segment Merging • Because objects may be represented by multiple segments at test time, some of those segments will contain only partial information from the object. • Resulting in less reliable label predictions. • Smooth a segment’s label distribution by incorporating information from segments which are likely to come from the same object, resulting in an updated label distribution
Spatial Smoothing by Segment Merging • Using the extra segments , we train an SVM classifier to predict when two segments belong to the same object. • By using the ground truth object annotation, we know when a pair of training segment came from the same object. • Given two segment and we compute: • pixel and region interaction features. • overlap between segment masks. • normalized segment centroids. • number of segments obtained in the segmentation. • Euclidean distance between the two segment centroids.
Spatial Smoothing by Segment Merging • We construct an undirected graph where each vertex is a segment, and edges are added between pairs that the classifier predicts should be merged, resulting in a new object segment . • The smoothedlabel distribution is the geometric mean of the segment distribution and itscorresponding object’s distribution:
Multi-Class Multi-Kernel Approach • Large Margin Nearest Neighbor • Multiple Kernel Extension • Spatial Smoothing by Segment Merging • Contextual Conditional Random Field
Contextual Conditional Random Field • Pixel and region interactions can be described by low-level features, but object interaction require a high-level description, e.g., it’s label. • We follow the soft label prediction with Conditional Random field(CRF) that encode high-level object interaction.
Contextual Conditional Random Field • We learn potential functions from object co-occurrences, capturing long-distance dependencies between whole regions of the image and across classes. • Our CRF model is described as: treating the image as a bag of segment: , represents the vector of labels for the segment in • The final label vector is the value of which is maximize.
Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion
Contextual Interactions • In this part, we describe the featureswe use to characterize each level of contextual interaction. • Pixel level interaction. • Region level interaction. • Object level interaction.
Pixel Level Interaction • Pixel level interactions can implicitly capture background contextual information as well as information about object boundaries. • We use a new type of contextual source, boundary support.
Pixel Level Interaction • Encode by computing a histogram over LAB color value between 0 and pixel away from the object’s boundary. • Compute the -distance between boundary support histogram H: • Define the pixel interaction kernel as:
Region Level Interaction • By using large windows around an object, known as contextual neighborhoods [7], regions encode probable geometrical configurations, and capture information from neighboring (parts of) objects.
Region Level Interaction • Computed by dilating the bounding box around the object by using a disk of diameter: • We model region interactions by computing the gist[31] of a contextual neighborhood, Gi. • Our region interaction are represented by the kernel:
Object Level Interactions • To train the object interaction CRF, we derive semantic context from the co-occurrence of objects within each training image. • A co-ocurrence matrix A • A(i,j) counts the times an object with label ci appears in a training image with an object with label cj. • Diagonal entries correspond to the frequencyof the object in the training set.
Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion
Experiments • Database : MSRC and PASCAL 2007 • Appearance feature : • SIFT • Self-similarity (SSIM) • LAB histogram • Pyramid of Histogram of Oriented Gradients (PHOG). • Context feature : • GIST • LAB color
Result • Object localization: • Mean accuracy results
Result MSRC presents more co-occurrences of object classes per image than PASCAL, providing more information to the object interaction model.
Result • Feature combination: • Learning the optimal embedding
Result • Learned kernel weights
Result • Comparison to other model: • MSRC • PASCAL 07
Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion
Conclusion • We have introduced a novel framework that efficiently and effectively combines different levels of local context interactions. • Our multiple kernel learning algorithm integrates appearance features with pixel and region interaction data. • We obtain significant improvement over current state-of-the-art contextual frameworks. • Adding another object interaction type, such as spatial context [8], localization accuracy could be improved further.