550 likes | 562 Views
This research aims to infer 3D human motion from 2D image properties without manual intervention. The study focuses on dealing with challenges such as unknown, cluttered environments, self-occlusion, low contrast, and ambiguous matches. The goal is to represent uncertainty, model non-linear dynamics, exploit image cues, integrate information over time, and combine multiple image cues to accurately estimate human motion.
E N D
Learning the Appearance and Motion of People in Video (The Science of Silly Walks) Hedvig Sidenbladh Michael J. Black Defense Research Institute Stockholm Sweden Department of Computer Science Brown University http://www.nada.kth.se/~hedvig http://www.cs.brown.edu/~black
Collaborators David Fleet, Xerox PARC Nancy Pollard, Brown University Dirk Ormoneit and Trevor Hastie Dept. of Statistics, Stanford University Allan Jepson, University of Toronto
The (Silly) Problem Unsolved without manual intervention.
Inferring 3D Human Motion * Infer 3D human motion from 2D image properties. * No special clothing * Monocular, grayscale, sequences (archival data) * Unknown, cluttered, environment * Incremental estimation
Singularities in viewing direction Unusual viewpoints Self occlusion Low contrast Ambiguous matches Why is it Hard?
Large Motions Limbs move rapidly with respect to their width. Non-linear dynamics. Motion blur.
Ambiguities Where is the leg? Which leg is in front?
Ambiguities Accidental alignment
Ambiguities Occlusion Whose legs are whose?
Requirements 1. Represent uncertainty and multiple hypotheses. 2. Model non-linear dynamics of the body. 3. Exploit image cues in a robust fashion. 4. Integrate information over time. 5. Combine multiple image cues.
Simple Body Model * Limbs are truncated cones * Parameter vector of joint angles and angular velocities = f
Need a constraining likelihood model that is also • invariant to variations in human appearance. 2. Need a prior model of how people move. 3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities. Inference/Issues Bayesian formulation p(model | cues) = p(cues | model) p(model) p(cues)
What Image Cues? Pixels? Temporal differences? Background differences? Edges? Color? Silhouettes? Optical flow?
Brightness Constancy I(x, t+1) = I(x+u,t) + h Image motion of foreground as a function of the 3D motion of the body. Problem: no fixed model of appearance (drift).
State of the Art. Bregler and Malik ‘98 • * Brightness constancy cue • insensitive to appearance • * Full-body required multiple cameras. • * Single hypothesis. • MAP estimate
State of the Art. Cham and Rehg ‘99 * Single camera, multiple hypotheses. * 2D templates (solves drift but is view dependent) I(x, t) = I(x+u,0) + h
Edges as a Cue? • Probabilistic model? • Under/over-segmentation, • thresholds, …
Deutscher, North, Bascle, & Blake ‘99 State of the Art. * Multiple cameras * Simplified, clothing, lighting and background.
What do people look like? Changing background Varying shadows Occlusion Deforming clothing Low contrast limb boundaries What do non-people look like?
Key Idea #1 (Rigorous Likelihood) 1. Use the 3D model to predict the location of limb boundaries (not necessarily features) in the scene. 2. Compute various filter responses steered to the predicted orientation of the limb. 3. Compute likelihood of filter responses using a statistical model learnedfrom examples.
Natural Image Statistics * Statistics of image derivatives are non-Gaussian. * Consistent across scale. Ruderman. Lee, Mumford, Huang. Portilla and Simoncelli. Olshausen & Field. Xu, Wu, & Mumford. …
Statistics of Edges Statistics of filter responses, F, on edges, pon(F), differs from background statistics, poff (F). Likelihood ratio, pon/ poff , can be used for edge detection and road following. Geman & Jednyak and Konishi, Yuille, & Coughlan What about the object specific statistics of limbs? * edge may be present or not.
Edge Filters Normalized derivatives of Gaussians (Lindeberg, Granlund and Knutsson, Perona, Freeman&Adelson, …) Edge filter response steered to limb orientation: Filter responses steered to arm orientation.
Distribution of Edge Filter Responses pon(F) poff(F)
Contrast Normalization? Lee, Mumford & Huang
Contrast Normalization • Maximize difference between distributions • * e.g. Bhattarcharyya distance:
Ridge Features Scale specific
Ridge Filters Relationship between limb diameter in image and scale of maximum ridge filter response.
Brightness Constancy I(x, t) I(x+u, t+1) What are the statistics of brightness variation I(x, t) - I(x+u, t+1)? Variation due to clothing, self shadowing, etc.
Brightness Constancy • well fit by t-distribution or Cauchy distribution (heavy tails) • related to robust statistics
Key Idea #2 (Explain the Image) p(image | foreground, background) Generic, unknown, background Foreground person See also McCormick and Isard, ICCV’01. Foreground should explain what the background can’t.
Likelihood Steered edge filter responses crude assumption: filter responses independent across scale.
2. Need a prior model of how people move. Inference/Issues Bayesian formulation p(model | cues) = p(cues | model) p(model) p(cues) • Need a constraining likelihood model that is also • invariant to variations in human appearance.
joint angles time Learning Human Motion * constrain the posterior to likely & valid poses/motions * model the variability 3D motion-capture data. * Database with multiple actors and a variety of motions. (from M. Gleicher)
Key Idea #3 (Trade learning for search.) Problem: * insufficient data to learn a prior probabilistic model of human motion. Alternative: * the data represents all we know * replace representation and learning with search. (challenge: search has to be fast)
Texture Synthesis Synthetic Texture “Database” * De Bonnet & Viola, Efros & Leung, Efros & Freeman, Paztor & Freeman, Hertzmann et al, … * Image(s) as an implicit probabilistic model. Efros & Freeman’01
Implicit Probabilistic Model Key idea: probabilistic search (log time) of this tree approximates sampling from p(stored sequence | generated sequence).
Synthesis • * Colors indicate different training sequences. • * For graphics, we need • - editability, constraints (ground contact, pose, interpenetration), key frames, style, …
Tracking * Efficiently generate samples (image data will sort out which are good). * Temperature parameter controls randomness of tree search.
Posterior over model parameters given an image sequence. Temporal model (prior) Likelihood of observing the image given the model parameters Posterior from previous time instant Bayesian Formulation
Elbow bends What does the posterior look like? Shoulder: 3dof Elbow: 1dof
Inference/Issues Bayesian formulation p(model | cues) = p(cues | model) p(model) p(cues) • Need a constraining likelihood model that is also • invariant to variations in human appearance. 2. Need a prior model of how people move. 3. Need an effective way to explore the model space (very high dimensional) and represent ambiguities.
Key Idea #4 (Represent Ambiguity) * Represent a multi-modal posterior probability distribution over model parameters - sampled representation - each sample is a pose and its probability - predict over time using a particle filtering approach. Samples from a distribution over 3D poses.
Particle Filtering * large literature (Gordon et al ‘93, Isard & Blake ‘96,…) * non-Gaussian posterior approximated by N discrete samples * explicitly represent the ambiguities * exploit stochastic sampling for tracking
Particle Filter Posterior Temporal dynamics sample sample normalize Posterior Likelihood
Particle Filter Isard & Blake ‘96