650 likes | 940 Views
Real-Time Human Pose Recognition in Parts from Single Depth Images. Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper. OUTLINE.
E N D
Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper
OUTLINE • Introduction • Data • Body Part Inference and Joint Proposals • Experiments • Discussion
Introduction • Robust interactive human body tracking • gaming, human-computer interaction, security, • telepresence, health-care • Real time depth cameras • tracking from frame to frame but struggle to re-initialize quickly and so are not robust • Our focus on per-frame initialization + tracking algorithm • focus on pose recognition in parts • 3D position candidates for each skeletal joint
Introduction • appropriate tracking algorithm • Tracking people with twists and exponential maps (CVPR 1998) • Tracking loose limbed people (CVPR 2004) • Nonlinear body pose estimation from depth images (DAGM 2005) • Real-time hand-tracking with a color glove (ACM 2009) • Real time motion capture using a single time-of-flight camera (CVPR 2010)
Introduction • inspired by recent object recognition work that divides objects into parts • Object class recognition by unsupervised scale-invariant learning [CVPR 2003] • The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006] • Two key design goals • Computational efficiency • robustness
Introduction dense probabilistic body part labeling + spatially localized near skeletal joints Depth Image 3D proposal segment generate
Introduction • We treat the segmentation into body parts as a per-pixel classification task • Evaluating each pixel separately • Training data • generate realistic synthetic depth images • train a deep randomized decision forest classifier avoid overfitting
Introduction • Overfitting • Simple, discriminative depth comparison image features • maintaining high computational efficiency
Introduction • For further speed, the classifier can be run in parallel on each pixel on a GPU • mean shift resulting in the 3D joint proposals
Data A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN What is Mean Shift ? • PDF in feature space • Color space • Scale space • Actually any feature space you can conceive • … Non-parametric Density Estimation Discrete PDF Representation Non-parametric Density GRADIENT Estimation (Mean Shift) PDF Analysis
Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls
Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls
Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls
Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls
Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls
Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls
Region of interest Center of mass Intuitive Description Objective : Find the densest region Distribution of identical billiard balls
Main contribution • Treat pose estimation as object recognition • using a novel intermediate body parts representation • spatially localize joints • low computational cost and high accuracy
Experiments • (i) synthetic depth training data is an excellent proxy for real data • (ii) scaling up the learning problem with varied synthetic data is important for high accuracy • (iii) our parts-based approachgeneralizes better than even an oracular exact nearest neighbor
Data • Depth imaging and Motion capture data • Pose estimation research • often focused on techniques • lack of training data • Two problems on depth image • color • pose
Depth image • Use real mocapdata • Retargettedto a variety of base character models • to synthesize a large, varied dataset • 640x480 image at 30 frames per second • Depth cameras > Traditional intensity sensors • working in low light levels • giving a calibrated scale estimate • resolving silhouette ambiguities in pose
Motion capture data • capture a large database of motion capture (mocap) of human actions • approximately 500k frames • (driving, dancing, kicking, running, navigating menus) • Need not record mocap with variation in rotation • vertical axis, mirroring left-right, scene position body shape and size, camera pose • all of which can be addedin (semi-)automatically
Motion capture data • The classifier uses no temporal information • static poses • not motion • frame to the next are so small as to be insignificant • using ‘furthest neighbor’ clustering algorithm • where the distance between poses • j mean body joints , Pi mean i pose • Define distance more than 5 cm
Motion capture data • necessary to iterate the process of motion capture • sampling from our model • training the classifier • testing joint prediction accuracy • CMU mocap database
Generating synthetic data • build a randomized rendering pipeline • sample fully labeled training images • Goals • realism and variety
Generating synthetic data • First : randomly samples a set of parameters • Then uses standard computer graphics techniques • render depth and body part images • from texture mapped 3D meshes • Use autodeskmotionbulider • slight random variation in height • and weight give extra coverage of body shapes • Others parameters
Body Part Inference and Joint Proposals • Body part labeling • Depth image features • Randomized decision forests • Joint position proposals
Body part labeling • intermediate body part representation • as color-coded • Some directly localize particular skeletal joints • others fill the gaps • transforms the problem into one that can readily be solved by efficient classification algorithms
Body part labeling • The parts are specified in a texture map
Body part labeling • 31 body parts: • LU/RU/LW/RW head, neck, • L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R • hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, • L/R ankle, L/R foot (Left, Right, Upper, loWer)
Depth image features • di (x) is the depth at pixel x in image I • Ө= (u, v) describe offsets u and v • 1/di (x) ensures the features are depth invariant
Depth image features • Individually these features provide only a weak signal • combination in a decision forest • sufficient to accurately • disambiguate all trained parts
Depth image features • The design of these features was strongly motivated by their computational efficiency • no preprocessing is needed • read at most 3 image pixels • at most 5 arithmetic operations • straightforwardly implemented on the GPU
Randomized decision forests • Randomized decision forests • fast and effective multi-class classifiers • Implemented efficiently on the GPU • 1
Joint position proposals • generate reliable proposals for the positions of 3D skeletal joints • the final output of our algorithm • used by a tracking algorithm to self initialize • and recover from failure
Joint position proposals • A local mode-finding approach based on mean shift with a weighted Gaussian kernel • ^xi is the reprojection of image pixel xi • bc is a learned per-part bandwidth • world space given depth dI (xi)
Assumption : The data points are sampled from an underlying PDF Non-Parametric Density Estimation Data point density implies PDF value ! Assumed Underlying PDF Real Data Samples
Non-Parametric Density Estimation Assumed Underlying PDF Real Data Samples
Non-Parametric Density Estimation ? Assumed Underlying PDF Real Data Samples
Assumption : The data points are sampled from an underlying PDF Parametric Density Estimation Estimate Assumed Underlying PDF Real Data Samples
Joint position proposals • Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel
Joint position proposals • The detected modes • lie on the surface of the body • pushed back into the scene by a learned z offset produce a final joint position proposal • Bandwidth Bc = 0.065m • Threshold λc = 0.14 • Z offset = 0.039m • Set = 5000 images by grid search
Experiments • provide further results in the supplementary material • 3 trees, 20 deep, 300k training images per tree • 2000 training example pixels per image • 2000 candidate features Ө • 50 candidate thresholds ζ per feature
Experiments • Test data • challenging synthetic and real depth images to evaluate our approach • synthesize 5000 depth images • Real test set • 8808 frames of real depth images • 15 different subjects • 7 upper body joint positions
Experiments • Error metric: • quantify both classification • average of the diagonal of the confusion matrix • between the ground truth part label and the most likely inferred part label • Joint prediction accuracy • generate recall-precision curvesas a function of confidence threshold • quantify accuracy as average precision per joint