eNTERFACE’10 Vision Based Hand Puppet

eNTERFACE’10Vision Based Hand Puppet Final Presentation

Project Objective • To develop a multimodal interface to manipulate the low- and high level aspects of 3D hierarchical digital models • The hands and the face of possibly separate performers will be tracked. • Their gestures and facial expressions will be recognized. • Estimated features will be mapped to digital puppets in real-time.

Project Objective • The project involves • Tracking of both hands • Background segmentation • Skin color filtering • Particle filtering • Hand pose estimation • Dimensionality reduction • Gesture and expression • Hidden semi-Markov models • Keyframe classification • Facial parameter tracking • Active appearance models • Filtering • Visualization • Skeleton model • Inverse kinematics • Physics • Networking • XML

Work Packages • WP1: • Data collection and ground-truth creation for pose estimation module • WP2: • Hand posture modeling and training • WP3: • Stereovision based hand tracking • WP4: • Vision based facial expression tracking

Work Packages • WP5: • Gesture and expression spotting and recognition • WP6: • Skeleton and 3D model generation • WP7: • Development of the graphics engine with skeletal animation support • WP8: • Network Protocol design and module development

Flowchart of the System 3D Motion Tracking Module Network Protocol (Low Level) Vısualızatıon Face Feature Tracker Spotter & Classifier Network Protocol (HighLevel) Pose Estimator

WP1: Data collection and ground-truth creation for pose estimation • PROBLEM: Hand pose estimation requires annotated images for training • Each hand pose must be exactly known, which is not possible without special devices such as data gloves. • As this process requires a lot of work, we synthetically create the training images. • Poser: A software that can manipulate models with skeletons and render photorealistic images with python scripts.

WP1: Data collection and ground-truth creation for pose estimation • Poser • Imitates a stereo camera setup and produces photorealistic renders • Automatically generates the silhouettes from the rendered images • Allows python scripts to manipulate any parameter of the scene • A single script can generate an entire multicamera dataset

WP1: Data collection - Methodology • We iteratively increase the complexity of the visualized hand, starting with an open hand • Start with 3 degrees of freedom: • Side to side (waving) • Bend (up down) • Twist • We created 2x1000 images for training • Created for a certain stereo camera setup • Manipulated each d.o.f. in sequence • Extracted the silhouettes • Saved along with generating parameters

WP1: Data collection - Conclusion • Using Poser to generate training images is a very efficient method • Can potentially create a very large database in a few hours • It’s very simple to create any multicamera setup • Each extrinsic and intrinsic camera parameter can be set via python script • Automatically extracts the silhouettes • Provides high level manipulation parameters for the body parts • e.g. grasping and spreading for the hand

WP2: Hand Pose Estimation • AIM: • To estimate hand skeleton parameters using hand silhouettes from multiple cameras • IDEA: • Use dimensionality reduction to map silhouettes to a space with much lower dimensionality • When an unknown silhouette arrives • Search for the closest known point in the reduced space

WP2: Manifold Learning with Sparse-GPLVM • Poser’s Hand Model is used for rendering. • Hand Silhouette Images (80x80) are rendered. • 80x80 = 6400 dimensional silhouette vector. • 1000 training samples per camera has been captured by iteration over x, y and z for below values using a Python script: x = [0⁰, 90⁰], y = [-90⁰, +90⁰], z = [-60⁰, 60⁰] • 2 cameras are simulated that are placed orthogonal to each other: • 2000 training samples are collected

WP2: Preprocessing GPLVM and Learning Forward Mapping with NNs • GPLVM is a non-linear probabilistic PCA. • For additional speed gains a conventional PCA has been applied as a preprocessing step, capturing the 99% of the total variance. • GPLVM is applied afterwards. This made the optimization process ~4 times faster. • GPLVM finds a backward mapping from latent space (2D) to PCA feature space (~250D for 99% variance). • For fast generation of initial search points a forward mapping from feature space to latent space is trained using a NN with 15 hidden neurons.

WP2: Hand Pose Estimation - Flowchart Capture frame Find foreground Filter skin colors Extract silhouette Nearest Neighbor Classifier Use GPLVM from 2D per Camera Use PCA Use NN from PCA space to Latent space

WP2: Classification • 2 dimensional latent space has been found in a smooth fashion by GPLVM optimization. • Therefore as a classifier a nearest neighbor matcher has been used in the latent space. • Ground truth angles of the hand poses are known. An exact pose match is looked for. Any divergence from the exact angles is considered as a classification error. • For the synthetic environment prepared by poser a classification performance of 94% has been reached in 2D latent space.

WP3: Stereovision based hand tracking • Objective • Obtain the 3D position of the hands • Enable real-time low noise tracking of robust features for gesture recognition tasks • Track some intuitive spatial parameters to map them directly to the puppet • Approach • skin color as main cue for hand location • stereo camera to obtain 3D information • Particle-filtering for robust tracking

WP3: Stereovision based hand tracking

WP3: Stereovision based hand tracking • Skin-color segmentation • Bayesian color model • Chromatic color space (HS) • Train color model on image regions obtained from a face tracker

WP3: Stereovision based hand tracking • Particle filtering (CONDENSATION) • Initialization (Midterm result) • Biggest skin colored blob is assumed to be the hand • Stereo-matching to obtain 3D hand location • Tracking (new) • Color cue: accumulated skin colorprobability weighted by percentageof skin colored pixels in particle ROI • Depth cue: deviation of ROI disparityfrom disparity value implied byparticle location

WP3: Stereovision based hand tracking

WP4: Vision-based Emotion Recognition • AIM: • To enable the digital puppet to imitate facial emotions. • To change digital puppet’s states using facial expressions. • METHODOLOGY: • Active shape model based facial landmark tracker • Track all shape parameters – set of points • Extract useful features manually • Eyebrow leverage, lip width etc. • Classify features • Using HSMM • Using nearest neighbor classifier

WP4: Vision-based Emotion Recognition • The ASM is trained using the annotated set of face images. • It starts the search from the mean shape aligned to the face located by a global face detector • The following are repeated until convergence • Adjust the locations of shape points by template matching of the image texture around each landmark and propose a new shape • Conform this new shape to a global shape model (based on PCA).

Real-time Vision-based Emotion Recognition Video frame captured from an ordinary webcam Feature Extraction based on intensity changes in shown specific face regions and distances of specific landmarks Facial Landmark Tracking based on Active Shape Models both generic and person-specific model Emotion Recognition Six universal gestures happiness sadness surprise anger fear disgust

WP4: Vision-based Emotion Recognition

WP4: Vision-based Emotion Recognition – Nearest Neighbor

WP5: Gesture and expression spotting and recognition • Has to “spot” gestures and expressions in continuous streams • No start-end signals • Should recognize only when a command is over • Should run in real time • We use hidden semi-Markov models • Inhomogeneous explicit duration models

WP5: Gesture and expression spotting and recognition • HMM • HSMM

WP5: Gesture and expression spotting and recognition • Why HSMMs? • HMMs model duration lengths only implicitly • Using self transition probabilities • Imposes geometric distribution on each duration • Variance-mean correlated • High mean – high variance • Geometric distribution and/or high variance do not conform to every application • Speech, hand gestures, expressions, ... • HSMMs explicitly model durations • HMMs are a special case of HSMMs

WP5: Gesture and expression spotting and recognition • Training module • Developed in MATLAB – no real time requirement • Yet very fast, does not require too many samples • Previously experimented with 25 hand gestures and continuous streams • Achieved 99.7% recognition rate • For this project, also experimented with facial expressions • Six expressions • Long continuous streams for training – not annotated • Results look good, no numerical results due to lack of ground truth (future work)

WP5: Gesture and expression spotting and recognition • Recognition module • Converted to an on-line algorithm • Uses the recent history to determine current state using Viterbi on a large HSMM • As expressions are independent, this does not introduce much error (about %1.5 in number of misclassified frames) • Runs in real time in MATLAB (not ported to C++ yet) • Performance analysis • Most of the error is attributable to • Noise • Global head motion • Rather weak vector quantization method

WP5: Gesture and expression spotting and recognition • Preliminary results

WP6: Skeleton and 3D model generation • We have utilized a skeletal animation technique. • The skeleton is predetermined and consists of 16 joints and accompanying bones.

WP7: Development of the graphics engine • Supports skeletal animation for the predetermined skeleton • Reads skeleton parameters at each frame from incoming command files • Applies the parameters to the model in real-time • Allows different models to be bound to the skeleton • Same skeleton can be bound to different 3D models • Supports inverse kinematics • Allows absolute coordinates as commands • Supports basic physics (gravity) • Allows forward kinematics via forces

WP7: Development of the graphics engine • Forward Kinematics: • "Given the angles at all of the robot's joints,what is the position of the hand?“ • Inverse Kinematics: • "Given the desired position of the robot's hand,what must be the angles at all of the robot's joints?“ • Cyclic-Coordinate Descent Algorithm for IK • Traverse linkage from distal joint inwards • Optimally set one joint at a time • Update end effector with each joint change • At each joint, minimize difference between end the effector and the goal

WP7: Development of the graphics engine

WP7: Development of the graphics engine • Future Work • Implement and Optimize CCD (%90 complete) • Load geometry data from Autodesk FBX file • Advanced Shading for puppets, i.e. fur • Rig to multiple models • Choose and implement a convenient method for visualizing face parameters and expressions

WP8: Network Protocol design and module development • “Visualization computer” acts as a server and listens to the other computers • accepts binary xml files • Works over TCP/IP • XML is parsed and parameters are extracted. • Each packet may contain several parameters and commands • Either low level joint angles as a set • Or a high level command, such as new gesture or expression

WP8: Network Protocol design and module development • Threaded TCP/IP Server • Binary XML • <?xml version="1.0" encoding="UTF-8" ?> • <handPuppet timeStamp=”str” source=”str”> • <paramset> • <H rx=”f” ry=”f” rz=”f” /> • <ER ry=”f” rz=”f” /> • <global tx=”f” ty=”f” tz=”f” rx=”f” ry=”f” rz=”f” /> • </paramset> • <anim id=”str” /> • <emo id=”str” /> • </handPuppet>

Conclusion • Individual modules • Most of the modules nearly complete • Final application • Tracked features not bound to skeleton parameters • Model skin and animations missing • New ideas that emerged during the workshop • Estimate forward mapping for GPLVMs using NNs • Use HSMMs for facial expressions • Fit 3D ellipse to the 3D point cloud of the hand • Extract manual features such as edge activity on the forehead

Future Work • Once hand tracking is complete, gestures will be trained using HSMMs • All MATLAB code will be ported to C++ (mostly OpenCV) • Hand pose complexity will be gradually increased, until not further possible in real time • Inverse Kinematics will be fully implemented • A face model that is capable of showing emotions will be incorporated to the 3D model for easy visualization

eNTERFACE’10 Vision Based Hand Puppet

eNTERFACE’10 Vision Based Hand Puppet

Presentation Transcript

Vision-Based Robot Control

Vision-Based Recognition of Continuos Dynamic Hand Gestures

PUPPET PRESENTATION

International Puppet Union

Vision-Based Metrology

Vision-Based Strategic Planning

Puppet at Oxford

Hand Gestures Based Applications

Shadow Puppet Theatre

Vision-Based Gesture Recognition

Taiwan Traditional Hand Puppet

PUPPET

Santa’s Elf Puppet

Puppet Show Groups

Vision-Based Robot Control

Group 10 – Helping Hand

CERN Puppet User Group Meeting, 2012-10-24

Vision-based Interaction

eNTERFACE 2005

“Merely a Puppet”

Portable Vision-Based HCI

Vision-Based Retrieval of Dynamic Hand Gestures