ICCV & CVPR paper reading

ICCV & CVPR paper reading 池晨@jdl.ac.cn 2009.11.27

CVPR09, # 2128 , Recognizing Indoor Scenes

Recognizing Indoor Scenes Ariadna Quattoni & Antonio Torralba • A. Quattoni, X. Carreras, M. Collins, T. Darrell, An Efficient Projection for L1,Infinity Regularization, ICML 2009. • A. Quattoni, A.Torralba, Recognizing Indoor Scenes, CVPR 2009. • A. Quattoni, M. Collins, T. Darrell, Transfer Learning for Image Classification with Sparse Prototype Representations , CVPR 2008. • A. Quattoni, M. Collins, T. Darrell, Learning Visual Representations using Images with Captions, CVPR 2007. • A. Quattoni, S. Wang, L.P. Morency, M. Collins, and T. Darrell, Hidden-state Conditional Random Fields, IEEE PAMI, 2007 Ariadna Quattoni Ph.D student MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Recognizing Indoor Scenes Ariadna Quattoni & Antonio Torralba • L.P. Morency, A. Quattoni, T. Darrell, Latent-Dynamic Discriminative Models for Continuous Gesture Recognition, CVPR 2007. • S. Wang, A. Quattoni, L.P. Morency, D. Demirdjian, T. Darrell, Hidden Conditional Random Fields for Gesture Recognition, CVPR 2006. • A. Quattoni, M. Collins, T. Darrell, Incorporating Semantic Constraints into a Discriminative Categorization and Labeling Model, Workshop on Semantic Knowledge in Vision, ICCV, 2005. • A. Quattoni, M. Collins and T. Darrell, Conditional Random Fields for Object Recognition, In Proceedings of NIPS, 2004. Ariadna Quattoni Ph.D student MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Recognizing Indoor Scenes Ariadna Quattoni & Antonio Torralba • Research Interests • Computer vision , • Machine learning, • Human visual perception, • Scene and object recognition. Antonio Torralba Associate ProfessorMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Recognizing Indoor Scenes Ariadna Quattoni & Antonio Torralba LabelMe: online image annotation and applicationsA. Torralba, B. C. Russell, and J. Yuen,MIT CSAIL Technical Report, 2009. How many pixels make an image?A. Torralba ,Visual Neuroscience, volume 26, issue 01, pp. 123-131, 2009. Small codes and large databases for recognitionA. Torralba, R. Fergus, Y. Weiss,CVPR,2008. 80 million tiny images: a large dataset for non-parametric object and scene recognitionA. Torralba, R. Fergus, W. T. FreemanIEEE Transactions on PAMI, vol.30(11), pp. 1958-1970, 2008. Sharing visual features for multiclass and multiview object detection A. Torralba, K. P. Murphy and W. T. Freeman,PAMI,2007. Antonio Torralba Associate ProfessorMIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

? Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain. Fig1. Comparison of Spatial Sift and Gist features for a scene recognition task. Both set of features have a strong correlation in the performance across the 15 scene categories. Average performance for the different features are: Gist: 73.0%, Pyramid matching: 73.4%, bag of words: 64.1%, and color pixels (SSD): 30.6%. In all cases we use an SVM.

Abstract • Indoor scene recognition is a challenging open problem. • By global spatial properties orby objects they contain? • Aprototype based model that can successfully combine • both sources of information. • A dataset of 67 indoor scenes categories. • Good results.

What is ‘a prototype based model’?

Prototype Image Prototype image?

ROI(Regions of Interests)

A Prototype Based Model Contained objects ROI 1 Global spatial properties For each scene category: For each prototypeTp: ROI mk …… Prototype Image T ROI2 ROI 5 ROI 3 ROI 4

How does it work?

Image Descriptor Contained objects ROI 1 Global spatial properties How to represent global spatial properties? ——Using Gist descriptor How to represent each ROI? ——Using a spatial pyramid of visual words ROI mk …… Prototype Image T ROI2 ROI 5 ROI 3 ROI 4

Gist (1/2) Magnitude of multiscale oriented filter outputs Orginal image orientation Scale Be decomposed Be taked the magnitude and be computed the local average response over 4*4 windows. PCA Gist feature Sampled filter outputs

Gist (2/2) The gist feature encodes edges and textures information in the original image coarsely Top row: original images. Bottom row: noise images coerced to have the same global features (N=64) as the target image.

ROI Descriptor The visual words are obtained by creating vector quantized Sift descriptors by applying K-means to a random subset of images. A spatial pyramid of visual words The color of each pixel represents the visual word to which it was assigned.

Model Formulation • Given: • A training set of n pairs of labeled images • A set of p segmented images which we called • prototypes. • Goal: • To use D and S to learn a mapping h : X→R

Model Formulation Contained object information The mapping should capture the fact that images containing similar objects must have similar scene labels and that some objects are more important than others in defining a scenes’ identity. Distances between two regions are computed using histogram intersections. where tkj represents the jth ROI of kth prototype image, xs represents the most similar segment with tkjin image x.

Searching Strategy When meet a new image, how to find its ROIs that similar with the ROIs in the given prototype image T? Histogram intersection function: Searching around a small window relative to the original location in prototype image T.

Searching Strategy Figure 5. Example of detection of similar image patches. The top three images correspond to the query patterns. For each image, the algorithm tries to detect the selected region on the query image. The next three rows show the top three matches for each region. The last row shows the three worst matching regions.

Model Formulation Global spatial information For some scene categorieds global image information can be very important. Global information is computed as L2 norm between the Gist representation of image x and the Gist representation of prototype k.

Model Formulation Parameters The importance of global features when considering the kth prototype. Global spatial information Contained object information How relevant the similarity to a prototype k is for predicting the scene label. Captures the importance of a particular ROI inside a given prototype.

Model Formulation Learning How to estimate the model parameters from a training set D? Regularization terms and the constants Cb and Cl dictate the amount of regularization in the model Loss function measuring the error that the classifier incurs on training example D.

Model Formulation Learning Using training set D and a gradient-based method to estimate the model parameters: Δis the set of indices of examples in D that attain non-zero loss.

Model Formulation The number in parenthesis is the classiﬁcation conﬁdence.

How is the performance?

Indoor Database Figure 2. Summary of the 67 indoor scene categories used in our study. To facilitate seeing the variety of different scene categories considered here we have organized them into 5 big scene groups. The database contains 15620 images. All images have a minimum resolution of 200 pixels in the smallest axis. • Compared with state of the art : • The largest one available: 67 categories, 15620 images. • More difficulte: In-class variability

Results (1/3) Four different variation of the model. Segmented ROIs Manually Annotated ROIs Both Local and Global features Local features

Results (1/3) Four different variation of the model. • Both local and global information are useful for the indoor scene recognition task. • Using automatic segmentations instead of manual segmentations cause only a small drop in performance

Results (2/3) Figure 7. The 67 indoor categories sorted by multiclass average precision (training with 80 images per class and test is done on 20 images per class).

Results (3/3) How is the preformance of the proposed model affected by the number of prototypes used? We observed a logarithmic growth of the average precision as a function of the number of prototypes. Exploit more prototypes might be able to further improve the performance.

Conclusion (1/3) Contained objects ROI 1 Global spatial properties ROI mk …… Prototype Image T ROI2 ROI 5 Combination of global information and contained object information ROI 3 ROI 4

Conclusion (2/3) Global spatial information Contained object information

Conclusion (3/3)

ICCV09, Learning to Predict Where Humans Look

Learning to Predict Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Education Background • Massachusetts Institute of Technology, Cambridge, MA • Ph.D. candidate in Computer Science (Graphics) Expected graduation June 2010, • Masters of Science, Computer Science, Jan 2007 • Bachelors of Science in Mathematics, June 2003. • École Polytechnique, Palaiseau, France • International Program, Computer Science Major, Sept 2003 to April 2004 • Cambridge University, Cambridge, England • Junior Year Abroad, Read Part IB Mathematics Tripos, Sept 2001 to June 2002 Tilke Judd Ph.D student MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Learning to Predict Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Research Interests • Computer Graphics • Computational Photography • Image Processing • Perception • Non-Photorealistic Rendering Tilke Judd Ph.D student MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Learning to Predice Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Judd, T, Ehinger, K, Durand, F, Torralba, A. Learning to Predict Where People Look, ICCV 2009. • Judd, T., Durand, F., Adelson, T. Apparent Ridges for Line Drawing. Proceedings of ACM Siggraph 2007 • Judd, Tilke. Apparent Ridges for Line Drawing. Masters Thesis, Computer Science, MIT, Jan 2007 • Ju, W., R. Hurwitz, T. Judd, B. Lee. CounterActive: An Interactive Cookbook for the Kitchen Counter. Proceedings of SIGCHI 2001, Short Papers and Abstracts, Seattle WA, April 2001. p 269 • Ju, W., L. Bonanni, R Fletcher, R. Hurwitz, T. Judd, J. Yoon E.R. Post, M. Reynolds. Origami Desk. Exhibited SIGGRAPH 2001, Los Angeles CA. SIGRAPH Conference Abstracts and Applications, August 2001, p.280 • Judd, Tilke. The JPEG Compression Algorithm. The MIT Undergraduate Mathematics Journal. Vol 5, p.119 Tilke Judd Ph.D student MIT Computer Science and Artificial Intelligence Laboratory(CSAIL)

Learning to Predict Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Education Background • University of Edinburgh, Edinburgh, UK2007 B.Sc. Psychology • California Institute of Technology, Pasadena, CA, USA2003 B.S. Engineering & Applied Science ? Erista Ehinger Graduate Student Department of Brain & Cognitive Sciences at MIT

Learning to Predict Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Education Background • He received his PhD from Grenoble University, France, in 1999. • From 1999 till 2002, he was a post-doc in the MIT Computer Graphics Group Frédo Durand Associate Professor Computer Graph Group,CSAIL,MIT.

Learning to Predict Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Research Interests • Synthetic image generation • Computational photography Frédo Durand Associate Professor Computer Graph Group,CSAIL,MIT.

Learning to Predict Where Humans Look Tilke Judd, Krista Ehinger, Fredo Durand, Antonio Torralba • Co-organized the first Symposium on Computational Photography and Video in 2005, • Co-organized the first International Conference on Computational Photography in 2009, • Was on the advisory board of the Image and Meaning 2 conference. • Received an inaugural Eurographics Young Researcher Award in 2004, • Received an NSF CAREER award in 2005, • Received an inaugural Microsoft Research New Faculty Fellowship in 2005, • Received a Sloan fellowship in 2006, • Received a Spira award for distinguished teaching in 2007. Frédo Durand Associate Professor Computer Graph Group,CSAIL,MIT.

? How to understand where humans look in a scenes without an eye tracking? Figure 2. Current saliency models do not accurately predict human ﬁxations. In row one, the low-level model selects brigh spots of light as salient while viewers look at the human. In row two, the low level model selects the building’s strong edges and windows as salient while viewers ﬁxate on the text.

Abstract • For many applications in graphics,design,and human • computer interaction,it is essential to understand where • humans look in a scene. • Models of saliency can be used to predict fixation locations. • A sailency model based on both the top-down information • and bottom up information • A large eye tracking database.

Database of Eye Tracking Data 15 objects 1003 random images Free view 3 seconds per image Recording the gaze path

Database of Eye Tracking Data Convolve a gaussian filter across the object’s fixation.Then average all the objects’ data to obtain a continuous saliency map. Select the top n percent salient locations to generate a binary map Collect the object’s fixations.

Analysis of Dataset • For some images,all viewers fixate on the same locations,while in other images viewers’s fixations are dispersed all over the image. • The fixation in the database have a strong bias towards the center. • Fixations from the database are often on animals,cars,and human body parts like eyes and hands. • There is a certain size for a region of interest(ROI)that a person fixates on.

ICCV & CVPR paper reading