1 / 60

Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images. Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper. OUTLINE.

luz
Download Presentation

Real-Time Human Pose Recognition in Parts from Single Depth Images

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper

  2. OUTLINE • Introduction • Data • Body Part Inference and Joint Proposals • Experiments • Discussion

  3. Introduction • Robust interactive human body tracking • gaming, human-computer interaction, security, • telepresence, health-care • Real time depth cameras • tracking from frame to frame but struggle to re-initialize quickly and so are not robust • Our focus on per-frame initialization + tracking algorithm • focus on pose recognition in parts • 3D position candidates for each skeletal joint

  4. Introduction • appropriate tracking algorithm • Tracking people with twists and exponential maps (CVPR 1998) • Tracking loose limbed people (CVPR 2004) • Nonlinear body pose estimation from depth images (DAGM 2005) • Real-time hand-tracking with a color glove (ACM 2009) • Real time motion capture using a single time-of-flight camera (CVPR 2010)

  5. Introduction • inspired by recent object recognition work that divides objects into parts • Object class recognition by unsupervised scale-invariant learning [CVPR 2003] • The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006] • Two key design goals • Computational efficiency • robustness

  6. Introduction dense probabilistic body part labeling + spatially localized near skeletal joints Depth Image 3D proposal segment generate

  7. Introduction • We treat the segmentation into body parts as a per-pixel classification task • Evaluating each pixel separately • Training data • generate realistic synthetic depth images • train a deep randomized decision forest classifier avoid overfitting

  8. Introduction • Overfitting • Simple, discriminative depth comparison image features • maintaining high computational efficiency

  9. Introduction • For further speed, the classifier can be run in parallel on each pixel on a GPU • mean shift resulting in the 3D joint proposals

  10. Data A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN What is Mean Shift ? • PDF in feature space • Color space • Scale space • Actually any feature space you can conceive • … Non-parametric Density Estimation Discrete PDF Representation Non-parametric Density GRADIENT Estimation (Mean Shift) PDF Analysis

  11. Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  12. Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  13. Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  14. Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  15. Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  16. Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

  17. Region of interest Center of mass Intuitive Description Objective : Find the densest region Distribution of identical billiard balls

  18. Main contribution • Treat pose estimation as object recognition • using a novel intermediate body parts representation • spatially localize joints • low computational cost and high accuracy

  19. Experiments • (i) synthetic depth training data is an excellent proxy for real data • (ii) scaling up the learning problem with varied synthetic data is important for high accuracy • (iii) our parts-based approachgeneralizes better than even an oracular exact nearest neighbor

  20. Data • Depth imaging and Motion capture data • Pose estimation research • often focused on techniques • lack of training data • Two problems on depth image • color • pose

  21. Depth image • Use real mocapdata • Retargettedto a variety of base character models • to synthesize a large, varied dataset • 640x480 image at 30 frames per second • Depth cameras > Traditional intensity sensors • working in low light levels • giving a calibrated scale estimate • resolving silhouette ambiguities in pose

  22. Motion capture data • capture a large database of motion capture (mocap) of human actions • approximately 500k frames • (driving, dancing, kicking, running, navigating menus) • Need not record mocap with variation in rotation • vertical axis, mirroring left-right, scene position body shape and size, camera pose • all of which can be addedin (semi-)automatically

  23. Motion capture data • The classifier uses no temporal information • static poses • not motion • frame to the next are so small as to be insignificant • using ‘furthest neighbor’ clustering algorithm • where the distance between poses • j mean body joints , Pi mean i pose • Define distance more than 5 cm

  24. Motion capture data • necessary to iterate the process of motion capture • sampling from our model • training the classifier • testing joint prediction accuracy • CMU mocap database

  25. Generating synthetic data • build a randomized rendering pipeline • sample fully labeled training images • Goals • realism and variety

  26. Generating synthetic data • First : randomly samples a set of parameters • Then uses standard computer graphics techniques • render depth and body part images • from texture mapped 3D meshes • Use autodeskmotionbulider • slight random variation in height • and weight give extra coverage of body shapes • Others parameters

  27. Generating synthetic data

  28. Body Part Inference and Joint Proposals • Body part labeling • Depth image features • Randomized decision forests • Joint position proposals

  29. Body part labeling • intermediate body part representation • as color-coded • Some directly localize particular skeletal joints • others fill the gaps • transforms the problem into one that can readily be solved by efficient classification algorithms

  30. Body part labeling • The parts are specified in a texture map

  31. Body part labeling • 31 body parts: • LU/RU/LW/RW head, neck, • L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R • hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, • L/R ankle, L/R foot (Left, Right, Upper, loWer)

  32. Depth image features • di (x) is the depth at pixel x in image I • Ө= (u, v) describe offsets u and v • 1/di (x) ensures the features are depth invariant

  33. Depth image features • Individually these features provide only a weak signal • combination in a decision forest • sufficient to accurately • disambiguate all trained parts

  34. Depth image features • The design of these features was strongly motivated by their computational efficiency • no preprocessing is needed • read at most 3 image pixels • at most 5 arithmetic operations • straightforwardly implemented on the GPU

  35. Randomized decision forests • Randomized decision forests • fast and effective multi-class classifiers • Implemented efficiently on the GPU • 1

  36. Randomized decision forests

  37. Randomized decision forests

  38. Joint position proposals • generate reliable proposals for the positions of 3D skeletal joints • the final output of our algorithm • used by a tracking algorithm to self initialize • and recover from failure

  39. Joint position proposals • A local mode-finding approach based on mean shift with a weighted Gaussian kernel • ^xi is the reprojection of image pixel xi • bc is a learned per-part bandwidth • world space given depth dI (xi)

  40. Assumption : The data points are sampled from an underlying PDF Non-Parametric Density Estimation Data point density implies PDF value ! Assumed Underlying PDF Real Data Samples

  41. Non-Parametric Density Estimation Assumed Underlying PDF Real Data Samples

  42. Non-Parametric Density Estimation ? Assumed Underlying PDF Real Data Samples

  43. Assumption : The data points are sampled from an underlying PDF Parametric Density Estimation Estimate Assumed Underlying PDF Real Data Samples

  44. Joint position proposals • Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel

  45. Joint position proposals • The detected modes • lie on the surface of the body • pushed back into the scene by a learned z offset produce a final joint position proposal • Bandwidth Bc = 0.065m • Threshold λc = 0.14 • Z offset = 0.039m • Set = 5000 images by grid search

  46. Joint position proposals

  47. Experiments • provide further results in the supplementary material • 3 trees, 20 deep, 300k training images per tree • 2000 training example pixels per image • 2000 candidate features Ө • 50 candidate thresholds ζ per feature

  48. Experiments • Test data • challenging synthetic and real depth images to evaluate our approach • synthesize 5000 depth images • Real test set • 8808 frames of real depth images • 15 different subjects • 7 upper body joint positions

  49. Experiments • Error metric: • quantify both classification • average of the diagonal of the confusion matrix • between the ground truth part label and the most likely inferred part label • Joint prediction accuracy • generate recall-precision curvesas a function of confidence threshold • quantify accuracy as average precision per joint

More Related