people Detection

people Detection Ali Taalimi Prof. Abidi 11/19/2012

Outline • Object Detection • Macro Scheme • Main Component • People Detection • Crowd • Suggested Approach

Macro Scheme

Macro Scheme Bottom–Up Scheme Patch List Identified Object Object List Frames

Macro Scheme Top–Down Scheme Object List Mask Silhouette List Trajectories Frames Fixed Rules Or Training

Main Components • method to extract relevant information from an image area occupied by a target. • representation for appearance and shape of a target to be used by the tracker and descriptive enough to cope with clutter. • links different instances of the same object over time which has to compensate for occlusion, clutter, illumination.

Features • Low Level Features: • Color: RGB, CIE, CIELUV, HIS • Gradient: Local intensity changes within (different reflectance properties of object parts, skin, hair) and at the boundary (different reflectance properties of the object with background) . • Laplacian of Gaussian (LoG)/Difference of Gaussains (DoG) • Motion: to detect and localize objects over time (optical flow). • Low level features cannot describe image contents completely. • Mid-Level Features: • Using subset of pixels that represent structures (edges, interest points/regions) • Interest points detector: select highly distinctive features that can be localized across multiple frames when pose and illumination change (corners). • High Level Features: • Detect object as a whole based on its appearance • Group mid level features • background modeling • Object Modeling (color based segmentation of the face detection)

Target Representation • How we can define a target in terms of its shape and appearance. • A target representation is a model of the object that is used by a tracking algorithm using shape and appearance of the target. • Shape/Appearance info can be encoded at different levels of resolution. • Example: • using bounding box or a deformable contour to approximate the shape of the target, • using pdf of some appearance features computed within the target area for encoding appearance info. • Uncertainty factors, like illumination changes, clutter, target interaction and occlusion should be account for defining how to represent target.

Target Representation Shape Representation: Basic, Articulated, Deformable Representations A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 1–45, 2006.

Target Representation Shape Representation: Basic Models: • Point Approximation: • limitation: target occlusion (as they don’t provide an estimate of the size of target ) • Area Approximation: bound target with a rectangle or ellipse. Motion, color, gradient of the entire target area used for tracking. • Tracking is performed by estimation parameters describing possible transformations of center, axes of these approximated shapes. • Lack of depth • Volume Approximation: multiple object occluding can be handled using spatial volume occupied by the target. • Lack of generality: only for objects available in training dataset. • Articulated models: approximate shape of the target by combining set of rigid models based on topological connections and motion constraints • e.g. full body tracking of human using topology of human skeleton

Target Representation Shape Representation: • Deformable Models: • Shape rigidity or kinematic assumptions doesn’t work for all target classes. • Where it happens? • Prior information on the object shape may not be available. • Object have deformations that are not well modeled by canonical joints. • Fluid Models: Instead of tracking the entire target area, interest points can be identified on the object and then used to track its parts. • No explicit motion constraints between the different parts • Example: Tracking detectable corners of the object. • Stable object tracking in occluded state. • Problem: how to group the points (which of them belong to the same object?). • Contours: more accurate description of the target. • Contour based trackers use set of control points positioned along the contour. • Concatenation of the coordinates of control points will be considered. • Use prior info of the target shape

Part Based Model • model consists of a global “root” filter and several part models. • Both root and part filters are computed by dot product between a set of weights and HOG features. • two different scales: • Coarse features are captured by a rigid template of entire detection window. • Finer scale features are captured by part templates with respect to the detection window. P. Felzenszwalb. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.

Target Representation Appearance Representation • The appearance representation is a model of the expected projection of the object appearance onto the image plane. • Unlike before, the representation of the appearance of the target may be specific to a single object and does not generalize across objects of the same class. • Usually paired with a function that given the image estimate the likelihood of an object being in a particular state. Target Position Input image Feature Extraction Color, Gradient Appearance Representation, Template, Histogram Learning Target Modeling Template: encode positional information of color values of all pixels within the target area. Histogram: ofColor, of Gradient

Outline • Object Detection • People Detection • ROI Selection • Background Modeling • Classification • Tracking • Crowd • Suggested Approach

Challenges Jitendra Malik

People Detection • main components of a people detection system: • hypothesis generation (ROI selection) • classification (model matching, verification ) • Tracking (temporal integration ) • Evaluation • ROI selection: initialized by general low-level features or prior scene knowledge. • Classification/Tracking require models of the people class, in terms of geometry, appearance or dynamics. New Images detected candidate targets Classification

ROI Selection • sliding window technique: detector windows at various scales and locations are shifted over the image. • computational costs are often too high. • speedups by either coupling with a classifier cascade, or by restricting the search space based on prior information about the target object class: • such as geometry of peoples, e.g., object height or aspect ratio • features derived from the image data • object motion • Background subtraction • interest-point detectors (ISM) • confidence density of the detector: • sliding-window: density is implicitly sampled in a discrete 3D grid (location and scale) by evaluating the different detection windows with a classifier. • Feature based: density is explicitly created in a bottom-up fashion through probabilistic votes cast by matching, local features.

Classification • Receives a list of ROIs that are likely to contain a people. • verification (classification) will work on people appearance models, using various spatial and temporal cues. • a given image/subregion is assigned to either the people or non-people class, based on its class posterior probabilities. • class posterior probabilities: probability of having people in that region given a model. • how to estimate posterior probability: • generative and discriminative models • Challenge: • missing detections: not all people are detected in each frame • false positive detections: detect non people • in case of no depth or scene information (e.g., ground plane), the detector does not know where to expect objects of which size in the image.

Classification D. Geronimo, “Survey on people detection for advanced driver assistance systems,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1239–1258, 2010.

Classification Generative Models: • modeling the appearance of the people class in terms of its class conditional density function. • Shape models:reducing variations in people appearance due to lighting or clothing. • Discrete approaches: represent the shape manifold by a set of large amount of exemplar shapes. • Continuous shape models:parametric representation of the class-conditional density, learned from a set of training shapes. • Combined shape and texture models: combine shape and texture information within a compound parametric appearance model. • texture cue represents the variation of the intensity pattern across the image region of target objects. • separate statistical models for shape and intensity variations.

Classification Discriminative Models: • approximate the Bayesian MAP decision by learning the parameters of a discriminant function (decision boundary) between the people and nonpeople classes from training examples. • Features: • Haar wavelet feature • codebook feature patches • gradient orientation histograms: dense and sparse (SIFT) • spatial configuration of salient edge-like structures: shapelet, edgelet • spatiotemporal features to capture human motion, especially gait • Classifier Architectures: • determine an optimal decision boundary between pattern classes in a feature space. • Feed-forward multilayer neural networks • Linear/Nonlinear Support Vector Machines (SVMs) • AdaBoost

Classification Multipart representations: • break down the complex appearance of the people class into subparts. • make local pose-specific people clusters, and training of a specialized expert classifier for each subspace. • integrate the individual expert responses to a final decision (model for geometric relations between parts) M. Enzweiler, “Multi-cue ped. classification with partial occlusion handling,” in IEEE Conf. Computer Vision and Pattern Recognition, 2010.

Classification Multipart representations: codebook representations: represent peoples in a bottom-up fashion as assemblies of local codebook features, combined with top-down verification. Codebook Generation B. Leibe, E. Seemann, B. Schiele, people detection in crowded scenes, CVPR 2005. people Detection in Crowded Street Scenes, Seemann et al, CVPR2005.

Different Learning Algorithms/Classifiers

Result of Classification • Holistic algorithms are unable to deal with high variability: • nonstandard poses greatly affect their performance • diversity of poses causes many peoples to be poorly represented during training (e.g., running people, children, etc.) • Parts-based algorithms that rely on dynamic part detection, handle pose changes better than holistic approaches. • Support vector machines (SVMs) and boostingare the most popular choices for LEARNING/Classifier. • Nearly all modern detectors employ: • gradient histograms • grayscale (e.g. Haar wavelets ) • color, texture • self-similarity and motion features.

Overall Performance Only Detection without Tracking (1) peoples at least 80 pixels tall, 20-30% of all peoples are missed (at 1 false alarm per 10 images). (2) Performance degrades catastrophically for smaller peoples. While peoples 30-80 pixels tall, around 80% are missed by the best detectors (at 1 false alarm per 10 images). (3) Performance degrades similarly under partial occlusion (under 35% occluded) (4) Performance is very bad at far scales (under 30 pixels) and under heavy occlusion (over 35% occluded). Under these conditions nearly all peoples are missed even at high false positive rates. P. Dollar, C. Wojek, B. Schiele, and P. Perona. people detection: An evaluation of the state of the art. In PAMI, volume 99, 2012

Background Modeling • Statistical Background Modeling (GMM, KDE) • Background Clustering • Neural Network Background Modeling • Background Estimation • steps and issues: • background modeling • background initialization • background maintenance • foreground detection • choice of the feature size (pixel, a block or a cluster) • choice of the feature type (color/edge/motion/texture features) • critical situations in video sequence: • Bootstrapping, Camouflage, Moved background objects, Inserted background, Waking foreground object, Sleeping foreground object and Shadows, Dynamic backgrounds, Illumination changes • two constraints: less computation time, less memory requirement

Statistical Background Modeling T. Bouwmans, F. El Baf, and V. B. Statistical Background Modeling for Foreground Detection: A Survey, volume 4 of Handbook of Pattern Recognition and Computer Vision, chapter 3. World Scientific Publishing, 2010.

Disadvantages of MOG • the number of Gaussians must be predetermined • the need for good initializations • the dependence of the results on the true distribution which can be non-Gaussian • slow recovery from failures • the need for a series of training frames absent of moving objects • the amount of memory required in this step. • Solution: Intrinsic, Extrinsic

MOG improvements Intrinsic Extrinsic

Features Improvements of the MOG Original method uses only RGB values of a pixel without assuming spatial knowledge.

Subspace Learning using PCA • Subspace learning offer a good framework to deal with illumination changes as it allows taking into account spatial information. • By assuming that the large part of image is background, expect to have only background using M largest Eigen vectors. • Limitation: • size of the foreground object must be small • Foreground should not appear in the same location during a long period in the training (stationary/slow motion foreground) • time consuming specially for colorful image.

Tracking provide correspondences between the regions of consecutive frames based on the features and a dynamic model. • Challenges in video tracking: • Robustness to clutter, Occlusion • False positives/negatives • Stability Target changes its pose so its appearance as seen by camera Clutter in video tracking: Objects in the background share similar color/shape with target occlusion by another moving object

Tracking Challenges Tracking Challenges Illumination Scene Weather Partial Occlusions Total Rotation Sensor noise Object Pose Translation Deformation

Tracking • Prevent false detections using all of the data within a spatial/temporal sliding window representing the most recent part of video. • Consider motion estimates to have correct data associations. • Predicting future peoples positions, thus feeding the foreground segmentation algorithm with pre-candidates. • Video Tracking In Computer Vision: • Window Tracking • Feature Tracking: detectable part of images • Tracking Local Features (corners, edges, lines) • Optic Flow Methods • Tracking Extended Features (ellipses, rectangles, contours, regions) • Tracking Deformable Contours (snake, parametric geometric model, deformable templates) • Visual Learning: instead of capturing shape/appearance apriori, learn it by examples.

Tracking • Tracking systems address: motion and matching • Motion problem: identify a limited search region in which the element is expected to be found with high probability • simplest way: define the search area in the next frame as a fixed-size region surrounding the target position in the previous frame. • Kalman filter (KF), Particle Filtering, Spatio-Temporal MRF, Graph correspondence, Event cones • Matching problem: (also known as detection or location) identify the image element in the next frame within the designated search region • similarity metric to compare candidates in the previous and current frame. • data association: occurs in the presence of interfering target/trajectories

Tracking Tracking by Detection Approach • involve the continuous application of a detection algorithm in individual frames and the association of detections across frames. • generally robust to changing background and moving cameras. • Why is association between detections and targets difficult? • Detection result degrades in occluded scene. • Detector output is unreliable and sparse, i.e., detectors only deliver a discrete set of responses and usually yield false positives and missing detections. The output of a person detector (right: ISM , left: HOG) with false positives and missing detections. M. D. Breitenstein, et al, Online multi-person tracking-by-detection from a single, uncalibrated camera," IEEE Trans. on Pattern Analysis and Machine Intell. (PAMI), vol. 33, no. 9, 2011.

Outline • Object Detection • People Detection • Crowd • Crowd Challenges • Crowd Information Extraction • Crowd Dynamic/Analysis • Tracking in Crowd Scene • Suggested Approach

Different Levels of Crowd Densities (a) Very low density; (b) Low density; (c)Moderate density; (d) High density; (e) Very high density A. Marana, L. da Costa, R. Lotufo, and S. Velastin, “On the efficacy of texture analysis for crowd monitoring,” in Proc. Int. Symp. Computer Graphics, Image Processing, and Vision (SIBGRAPI’98), Washington, DC, 1998,p. 354.

Crowd Challenges • straightforward extension of techniques designed for noncrowded scenes don’t work for crowded situations. • Because of severe occlusion, it is difficult to segment and track each individual in crowd. • In high-density video sequences the accuracy of traditional methods for object tracking decreases as the density of people increases. • Dynamic of a crowd itself is complex. • Goal Directed • Psychological Characteristic • Occlusion Reasoning: • If a partially occluded person is detected and associated to a trajectory, the classifier will be updated with noise and performance will degrade.

Crowd information extraction • 1. Crowd density measurement: crowd density or counting number of peoples • 2. Recognition: In extremely cluttered scenes, individual people segmentation is exceedingly difficult. • Face and Head Recognition • people and Crowd recognition: • Occlusion handling • Moving Camera: on-board vision system to assist a driver • Spatial-temporal methods • Tracking • Solve occlusion during and after the occurrence of occlusions. • using traceable image features • Human body model: models human body parts. Tracking is implemented by probabilistic data association, i.e. matching the object hypotheses with the detected response • Tracking inference strategies: particle filtering, MHT, JPDAF, Hungarian algorithm, greedy search, …

Crowd Analysis Crowd analysis using computer vision techniques, Jacquez et al, 2010, IEEE signal processing magazine

Crowd Dynamics/Analysis CROWD DYNAMICS Crowds can be characterized considering three different aspects: image space domain sociological domain: study the behavior of people in several years. computer graphics domain. CROWDANALYSIS Three issues of analysis of crowded scenes: people counting/density estimation models tracking in crowded scenes crowd behavior understanding models counting vs. tracking: Similarity: the goal of both is to identify the participants of a crowd Difference: counting is about to estimate of the number of people, position and temporal evolution isn't considered. Tracking determines the position of each person in the scene as a function of time.

Crowd Analysis PEOPLE COUNTING/DENSITY ESTIMATION MODELS • pixel-based analysis: • based on very local features (individual pixel analysis via background subtraction or edge detection). • mostly focused on density estimation rather than precise people counting. • texture-based analysis: • requires the analysis of image patches. • explores higher-level features comparing to pixel-based approaches • object-level analysis: • try to identify individual objects in a scene. • produce more accurate results comparing to the other two • identifying individuals is feasible in low density crowds • Very hard to solve for denser crowds • Highly dependent on the extraction of foreground blobs that generate the image features

Tracking In Crowded Scenes • unstructured environments: • motion of a crowd appears to be random with different participants moving in different directions over time (e.g., a crossway). • Approach should allow each location of the scene to have various crowd behaviors. • using the head as the point of reference rather than the entire body. because heads are rarely obscured from overhead surveillance cameras and are generally not obscured by clothing. • Multiple Target Tracking: • appearance based methods: • feed-forward systems which use only current and past observations to estimate the current state. • data association based methods which use future information also, to estimate the current state, allowing ambiguities to be more easily resolved at the cost of increased latency

Outline Object Detection and Tracking People Detection Crowd Suggested Approach

Goal Make Reliable Detection: low FP, FN (miss detection) Segment multiple possibly occluded humans in the image. Detect head/face of each person in video sequences. To obtain consistent trajectories of multiple possibly occluding humans in the video sequence. To track human robustly. Occlusion Reasoning

My Main Scheme Labeled ROIs Verified and refined ROIs Regions of interest

Detection Object detection can be performed by modeling and then classifying background and foreground. Training classifier on the appearance of the background pixels and then a detection is associated with each connected region/blob of the foreground pixels. Train a set of classifiers to encode the people (foreground). Statistical shape-and-texture appearance models to define a representation of people appearance. • Using multiple detection modules: • Shape based detection + texture based classification. • We should explain why certain modules were selected and howthey were integrated in the overall system.

Detection/Tracking • Head Detection for Tracking: • Detecting head in specific and part based model in general can be more reliable rarely obscured from cameras and clothing. • locating faces directly may not be possible due to occlusion, pose variations, or a relatively small size of the face region compared to the whole image. • Make Robust Multi-Target Tracker: • Instead of using detection and classification results to guide the tracker, we can use Coupling Detection and Tracking Using Tracklet: • update tracking model by the detection confidence density. • Using Spatio/Temporal knowledge like motion tracker and object attributes. • Example: Detection fails in occluded situation, tracker may help. In case of abrupt motion changes tracker drifts, detection can be helpful.

people Detection