COMPUTER VISION: SOME CLASSICAL PROBLEMS

ADWAY MITRA MACHINE LEARNING LABORATORY COMPUTER SCIENCE AND AUTOMATION INDIAN INSTITUTE OF SCIENCE June 24, 2013 COMPUTER VISION: SOME CLASSICAL PROBLEMS

WHAT IS COMPUTER VISION and WHY IS ITDIFFICULT? • Computer Vision, obviously, aims to build computers that can see! • In other words, it deals with analyzing/understanding images and videos through computers • Aim of analysis is to find known patterns in images - Detection, or match images with known patterns - Recognition • For analysis of image we first need a representation for it • An image is stored in a computer as a 2 or 3 dimensional matrix, each element a pixel • A single pixel carries very little, if any, semantic information!!!!

Representation with Features • For most applications of machine learning, the first and foremost step is to find features • Features are used for representation of the data • Features should be such that we can have a metric space for them - usually they are vectors • Very elaborate features (high-dimensional) need to be avoided for computational reasons Representation Dimensionality Reduction Smaller Feature Vector Feature Vector- Difficult to process

Features for Computer Vision • Pixel values can serve as features, but are often not very meaningful • Groups of pixels can have more meaning- but how to form such groups?? • Groups-of-pixels/sub-images at large number of scales and positions • Image gradients/edges • Various Filter Outputs have also been explored • Difficult to interpret semantically, but found to work well in certain applications • Finding concise, semantically meaningful features still a very major issue in Computer Vision

SIFT Interest Points • A filter is an operator which processes a signal and removes some undesired components • Difference-of-Gaussian Filters - a popular filter for images • Positions of local maxima of this filter output are the interest points • Some interest points, like those on the edges, are discarded • At each interest point, a feature vector is computed using image gradients and their orientations inside small windows around the interest point • This feature is invariant to orientation and scale of the image • SIFT: Scale-Invariant Feature Transform

SIFT INTEREST POINTS

FACE DETECTION-PROBLEM • Given an image, find the faces in it. • Used in many places like digital cameras and photo sharing albums, including Facebook • Given a rectangular region in an image, say if it is a face or not! • Repeat this process for every location and every size of the rectangular region

FACE DETECTION-GENERAL APPROACH • Basically a binary classification problem • Requires building model for face • Needs training samples- both positive and negative • Positive samples are face images, negative samples are non-face images NON-FACE images FACE images

FACE DETECTION-GENERAL APPROACH • Basically a binary classification problem • Requires building model for face • Needs training samples- both positive and negative • Positive samples are face images, negative samples are non-face images • Learning algorithm finds boundary between face and non-face images NON-FACE images FACE images

FACE DETECTION-GENERAL APPROACH • Basically a binary classification problem • Requires building model for face • Needs training samples- both positive and negative • Positive samples are face images, negative samples are non-face images • Learning algorithm finds boundary between face and non-face images NON-FACE images FACE images Candidate

FACE DETECTION- BENCHMARK and EVALUATION • Standard face-detection benchmark datasets available • FDDB: Face Detection dataset for unconstrained setting • Performance usually measured using Precision and Recall • Precision: Of the reported face detections, how many were actually faces? • Recall: Of the faces actually present, how many were detected? • F-score: Harmonic mean of precision and recall

FACE RECOGNITION-PROBLEM • Consists of a training phase and a testing phase • In the training phase we are given many face images, each marked with the identity of the person • In the testing phase, we are given a new face image, belonging to one of these persons • The task is to find out the identity of the person • This is a simple Classification problem in Machine Learning • First suitable features and representations have to be found

FACE RECOGNITION-PROBLEM • One approach is to build a model for each person, using the training images provided for him • Second approach is to compare the test image to each of the training images, and find the closest match • It may be observed that not every part of face image helps in recognition- certain things about faces are common to everyone • A good strategy is to find the features that are most distinctive and represent images only by them • Eigenfaces (1991) uses the last two strategies • Recognition accuracy is the obvious evaluation criteria • A good recognition algorithm should work well with less number of training images

FACE RECOGNITION-CURRENT STATUS • Face recognition has traditionally been done with well-cropped, focussed face images - Controlled Environment • Considered a solved problem. • Nowadays face recognition is being revisited for semi-controlled or uncontrolled environments. • LFW (Labelled Faces in Wild) - a dataset of face images taken in such settings - a new benchmark

OBJECT RECOGNITION-PROBLEM • Classification task like face recognition • Practically much more complex • Large number of images given from many object categories • Classify a test image into one of these categories • Problem made very difficult by intra-class variations

OBJECT RECOGNITION-GENERAL APPROACH • Once again the idea is to build models for different objects • No single feature may be enough for classification • Some objects may have a distinctive color, others may have a distinctive shape • Multiple Kernel Learning - a sophisticated machine learning formulation, generally considered the best approach for this problem • Caltech-101: a dataset of 101 object categories • Close to 80 % accuracy obtained by Multiple Kernel Learning • Caltech-256: a dataset of 256 object categories - Accuracy of 50 % considered good! • Intra-class variations continue to pose significant challenge and even scepticism - is it at all a valid problem???

OBJECT DETECTION • Given an image find all the birds, trees, and cars in it! • Requires building models for each of these objects • Once again search entire image at multiple positions and scales • Part-based Models of objects considered efficient • Instead of modelling whole object, model different parts separately • Helps to handle occlusion and perhaps intra-class variations

IMAGE SEGMENTATION • Given an image, divide it such that each segment contains an object • Basically a clustering problem • Does not require features and is done purely with pixel values • Has inspired advanced clustering techniques like spectral clustering • Graph-based method- models image as graph with each pixel representing a node and adjacent pixels connected by edges • Each edge is given a weight according to similarilty of the corresponding pixel values • Requires number of segments to be specified

IMAGE SEGMENTATION • Segmentation evaluated with respect to a gold standard segmentation • Every pair of pixels coming in the same segment in the gold standard should also be in same segment in the segmentation • (and similarly for each pair of pixels coming in different segments)

Video Problems • Videos are collections of images taken over an interval of time- successive images are quite similar • Having to handle several images rather than one may make video problems tougher • But the temporal continuity of videos provides a way out • Joint modelling of multiple similar images can, in fact, give better performance than modelling single image • For video tasks, additional motion-based features like optical flow can be used • Concept of Interest-points for images is extended to Space-Time Interest Points for videos • Face Recognition, Face Detection etc can also be done in videos, often more effectively than in images

OBJECT TRACKING-PROBLEM • Given a video which shows a person/object moving • Need to find it in each frame • Naive approach- reduce it to object detection problem • If object is at position (x, y) in frame t, it will be very close in frame (t + 1) • So if we know the position in time t, we need to search only around that same position • Reduces search space greatly!! • Main idea is to build an appearance model for the object • The appearance may change over time due to variations in size, illumination, viewpoint etc • The appearance model must be adaptive- and recomputed throughout the video

OBJECT TRACKING- BENCHMARK and EVALUATION • Performance measured with respect to gold standard, where in each frame a bounding box is provided • Proportion of overlapping areas of the gold standard and reported bounding boxes

OBJECT TRACKING-CURRENT STATUS • Considered a solved problem under controlled illumination and background • Current research aims to handle occlusion of the object, and sudden changes in background and illumination • Tracking multiple objects at the same time is another important problem • Tracking is a real-time application. Efforts are on to process as many frames as possible per second • To adapt or not adapt- remains the fundamental problem in vision. • A single miss can make the whole tracking go wrong. • Detection and correction of miss is an important problem to solve

ACTION RECOGNITION IN VIDEOS • Surveillance cameras are nowadays available at many sensitive public locations • The aim is to record activities of people • Requires use of dynamic features, which make use of the motion in videos • Some image-based features can be extended to videos, like space-time interest points • These can be used by viewing the video as a space-time volume • The features can also be in the form of time-series

ACTION RECOGNITION IN VIDEOS • In presenece of a benign background, static camera and a single actor, the problem is considered solved • Current research aims to handle complex environments, like crowded places, where the persons frequently get occluded • Multi-person interaction recognition is another recent branchout of the problem

COMPUTER VISION: SOME CLASSICAL PROBLEMS