170 likes | 247 Views
Tell Me What You See and I will Show You Where It Is. Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 U niversity of Wisconsin- M adison 2 U niversity o f Toronto 3 TTI Chicago jiaxu@cs.wisc.edu {faschwing, urtasung}@cs.toronto.edu CVPR 2014 Accepted Poster. Outline. Introduction
E N D
Tell Me What You See and I will Show You Where It Is Jia Xu1Alexander G. Schwing2 Raquel Urtasun2,3 1University of Wisconsin-Madison 2University of Toronto 3TTI Chicago jiaxu@cs.wisc.edu {faschwing, urtasung}@cs.toronto.edu CVPR 2014 Accepted Poster
Outline • Introduction • Related Work • Weakly Labeled Semantic Segmentation • Experimental Evaluation • Conclusion
Introduction • Motivation : densely labeling images is still a very expensive process, particularly since multiple annotators are typically employed to label each image. Furthermore, a quality control process is frequently required in order to sanitize the annotations. • Figure 1. Our approach takes labels in the form of which classes are present in the scene during training, and learns a segmentation model, even though no annotations at the pixel-wise are available. sky, building, tree
Introduction • The latent aspect model, which employs probabilistic latent semantic analysis(PLSA) to model each image as a finite mixture of latent classes also referred to as aspects. • Markov random field (MRF) e.g., leverage information between multiple images. • Conditional random field (CRF) e.g., Target detection, Identification, • SIFT-flow dataset e.g., object recognition, 3D modeling, gesture recognition. Advantage in this paper : • General setting • Improvements of 7% in terms of mean class accuracy over the state-of-the-art.
Weakly Labeled Semantic Segmentation - Semantic segmentation from tags • Figure 2. Graphical Model: (Left) Graphical model for learning as well as inference when the tags are provided at test time. (Right) Graphical model for inference when the tags are not provided at test time. • = exp • h = (h1,…, hN) to be the set of segmentation variables for all superpixels within one image. • y = (y1 ,…, yC) the set of binary variables indicating for all classes their presence/absence. sky, building, tree
Semantic segmentation from tags • = + + • Appearance model : We utilize the superpixelfeatures which include texture/SIFT, color, shape, location and GIST. • Presence/Absence : In the training, this potential is built from the ground truth, i.e., pres(yi,x) = [1;-1] if class i is absent, while pres(yi,x) = [-1; 1] if class iis present. • Compatibility : () = is a big number (105 in our experiments).
Weakly Labeled Semantic Segmentation- Learning in the weakly labeled setting • The training set D consists of |D| image-tag pairs (y; x), i.e., D = (y, x)i. • During learning, a loss function lis commonly included to bias the algorithm, i.e., • Loss function : The distribution of class presence as well as the distribution of pixel-wise labelings follows a power law distribution(i.e., many classes occur very rarely).l • li= is the ground truth label, and iis the prediction for the i-th class.
Learning in the weakly labeled setting • Weight vector ω, which minimizes the sum of the negative (loss-augmented) marginal log-posterior of the training data D and a regularization term which originates from a prior distribution on ω. • We employ the concave-convex procedure (CCCP), which is a generalization of expectation maximization(EM) to minimize Eq. (2). • CCCP : At each iteration we linearize the concave part at the current iterate w and solve the remaining convex objective augmented by a linear term to update the weight vector w.
Learning in the weakly labeled setting • Given this expectation, we solve a fully supervised objective with modified empirical means. it is crucial to notice that in our graphical model we can trivially solve the ‘latent variable prediction problem’ given the bi-partite model of the weakly labeled segmentation task. For the second step we need to solve a fully supervised learning task.
Experimental Evaluation • We perform our experiments using the SIFT-flow segmentation dataset, which contains 2688 images and C = 33 classes.
Experimental Evaluation • First experiment : Our first experiment utilizes tags only at training. We utilize an image-tag classifier which leverages deep learning in order to construct at test time. • Comparison to the state-of-the-art : We observe that our approach performs well for classes which have very distinctive and consistent appearance, e.g., sand, sun, staircases. We missed a few classes, e.g., bus, crosswalk, bird, due to their largely varying appearance and small training set size.
Experimental Evaluation • Quality of image-tag prediction : • Our CNN-Tag predictor predicts tags with an accuracy of 93.7%. • Tag prediction errors do not correlate well with segmentation errors, e.g., crosswalk and bird tags are predicted with very high accuracy, but segmentation accuracy is very low for both classes.
Experimental Evaluation • Qualitative results :
Experimental Evaluation • Tags given at both training and testing : In our second setting, tags are given both at training and testing.
Experimental Evaluation • Partial tags given at test time : We conduct our experiments using four different sample ratios {0.1, 0.2, 0.3, 0.5}.
Conclusion • We have presented an approach to semantic segmentation which is able to exploit weak labels in the form of image tags when no pixel-wise labeling are available. • Unlike existing approaches, this allowed us to leverage standard algorithms with good theoretical guarantees. • We have demonstrated the effectiveness of our approach and showed improvements of 7% over the state-of-the-art in this task.