Skeleton based Continuous Gesture Recognition

Skeleton based Continuous Gesture Recognition Chi Tsung,Chang

Outline • Introduction • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

Introduction • Hand Gesture recognition – Understand the meaning of some motion of the hand • Static hand gesture • Dynamic hand gesture • Dynamic hand gesture type • Presegmented • Continuous(weakly segmented)

Introduction • Dataset – Dynamic hand gesture14/28 dataset • Dynamic hand gesture • Using Intel RealSense • Contain Depth and skeleton (xyz coordinate /by Intel RSSDK) • 14 Gesture / 20 subjects / perform 5 times / 2 ways (one finger / whole hand)  2800 sequences (http://www-rech.telecom-lille.fr/DHGdataset/)

Introduction • Target : Using the dataset DHG 14/28 to implement a continuous hand gesture recognition system based on skeleton • Why skeleton: • the feature can fully represent hand motion • for interaction system in the environment like AR and VR, usually hand’s pose (skeleton) is known, thus should take this information to do the gesture recognition

Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

Recurrent network • 1.Self loop with input sequence • 2. sharing parameters through time (“Deep learning”,Ian Goodfellow and Yoshua Bengio and Aaron Courville,2016, fig 10.2)

Recurrent Neural Network Training: back propagation through time(BPTT) (“Deep learning”, fig 10.3)

Long term dependency • Sometimes, there will be relation with two objects have long term dependency (ex: “the clouds are in the sky”) (“Understanding LSTM Networks”, Christopher Olah)

Long term dependency challenge • Information explode or vanish • Gradient problems Gradient Vanish / explode Network will be easy to converge into having short term memory (“Understanding LSTM Networks”, Christopher Olah)

Long – short term memory(LSTM) • Using gate to control information flow • Can choose the parts to be memorized or replaced (“Unsupervised Learning of Video Representation using LSTMs”,Nitish Srivasta,2015)

Gated recurrent unit(GRU) • Introduced by Kyunghyun Cho, 2014 • Make it simple  one gate control both input and forget / no output gate • Construct dependency between new and old information reset gate Update gate (“Understanding LSTM Networks”, Christopher Olah)

LSTM & GRU • LSTM: separate GRU: combine • GRU: always have output • LSTM: new state will consider all old state GRU: pick part of old state

Multilayer RNN • Stack RNN layer by layer • Time Scale: By stacking RNN, higher layer can have a longer memorizing scale.[1] … … …

Related work -1 • “Skeleton-based Dynamic hand gesture recognition”,CVPR 2016 • Hand – Crafted feature + SVM Skeleton - based Depth - based

Related work -2 • VIVA Hand Gesture Challenge,2015 • Intensity + Depth

Dynamic hand gesture • “Hand Gesture Recognition with 3D Convolutional Neural Networks”,2015

Related work -3 3Dynamic hand gesture – continuous/weakly segmented NVIDIA Dynamic Hand Gesture Dataset,2016

Dynamic hand gesture – continuous/weakly segmented • CTC: sequence to sequence learning by maximize log likelihood of the label sequence given the input sequence • Allow softmax for “blank” class

Model • Multiple layer RNN using LSTM or GRU • Share the same units number • 1 layer of fully connect layer (14 output units)at output Output(possibility) Input(Skeleton) … … …

Data Augmentation • Rotate about x,y,zaxises , randomly

Normalization • Normalize with related to wrist  hand pose information • Keep wrist’s coordinate  tracking information

Hyper parameter Vs performance • 90% training /10% testing • Only have limited influence on the performance

Continuous gesture recognition • Method: Sliding window and threshold • Criterion: Recall/Precision/F1 Score ……. Swipe X Video Swipe left Swipe up Grab ……. time Label F1 score:

Continuous gesture recognition • 12 sequences, each contains 4~8 gestures • Total possible combination= 14*14=196 • Now only about 50 cases • Skeleton input only

Result • X axis: Sliding window size • Low precision: Sub gesture

Output VS Ground Truth

Sub gesture problem X axis: Percentage Y axis: Possibility • Output of the presegmented input

Range Problem • Too Close or far (tracking information) Influence the output • Relative replacement of the track of the wrist

Single: Relative VS absolute Number: augmentation Relative displacement perform better at the best Performance of recall and precision But F1 score is mainly influenced by precision

Absolute VSRelative

Single: 3layers vs 2layers Tree(2 layers) Best recall: MAX: 25, rate: 0.828 Best precision: MAX: 35, rate: 0.269 Best F1: MAX: 27, rate: 0.396 3 layers Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410

RNN forest • Train 3 model to do the prediction • Training sets are the same, but the grouping of minibatches are different • For one model: 1 input -> 2 layers RNN(GRU) • Got a slightly better performance for the same testing set (93.5%~94.5%) Training set RNN RNN RNN RNN RNN RNN ALL ALL ALL Average Output

Forest VS single • Forest:2 layer RNN*3 single:3 layer RNN*1 , all 150 units per layer • Forest perform more stable, but of course take time Relative Best recall: MAX: 19, rate: 0.766 Best precision: MAX: 35, rate: 0.397 Best F1: MAX: 29, rate: 0.480 Single Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410

Average the probability Improvement F1:0.4216 -> 0.5 False alarm decrease 0.278 -> 0.375 Recall decrease 0.778 ->0.75

Problem to solve • Precision still too low  sub gesture  need to output the possibility for every class every time step • Should only output the definite result at specific time

Sequence to Sequence labelling • Output label Input sequence length • No need to output at specific time, output when the answer is clear • CTC – Connectionist Temporal Classification

A Path to Path mapping • Introduced a “blank” label • Mapping : F(- - A – B - )=AB F(- AB ---)=AB F(A---B-)=AB • “blank ” enable output no class during transition / silence … etc. Fig from [2]

Calculate the likelihood • From input to output distribution • From all possible output to label sequence  objective function

Forward – Backward Algorithm • Consider all possible output path  too many combination with long input sequence Fig from [2]

Forward – Backward Algorithm • Dynamic Programing : calculate the likelihood recursively • Forward • Backward seamless: • For calculating Loss w.r.t. each time’s output

Training • Slow(140k iteration with 32 samples per batch) and converge at poor performance point ( negative log likelihood ~=1.5, likelihood ~=0.2213)

Batch normalization • Normalize mean and variance

Improved result • Might need more time, but the difference is obvious

Output problem • Blank only: ideally: most blank, with some spike • Start with arbitrarily class with high possibility • The class might the predict result of this sequence, so ctc loss cannot go down

Conclusion and problems waiting to solve • Sliding window leave serious problem with variable length of different gesture • Sub gesture problem is critical for precision (false alarm) • CTC should have the ability to solve the problem by giving “blank” class • Problem1: loss still too high (0.58), need more time? Structure? • Problem 2: blank dominate all time

Ref • [1] “Training and analyzing deep recurrent neural networks” • [2] “Supervised Sequence Labelling with Recurrent Neural Networks”,AlexGraves • [3] “Deep learning”,Ian Goodfellow and YoshuaBengio and Aaron Courville,2016 • [4] “Understanding LSTM Networks”, Christopher Olah • [5] “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,SergeyIoffe, Christian Szegedy,2015

Skeleton based Continuous Gesture Recognition

Skeleton based Continuous Gesture Recognition

Presentation Transcript

Gesture Recognition Market

Gesture Recognition

Gesture Recognition

Gesture Recognition Interface Device

Gesture Recognition

GESTURE BASED COMPUTING

Gesture Recognition

Vision-Based Gesture Recognition

Gesture Recognition Interface Device

Gesture Recognition / Sign Language

Gesture Recognition System

Gesture Recognition

Continuous Word Recognition

Gesture recognition

Overview on Gesture Recognition

Inertial Gesture Recognition

Gesture Recognition Market

Gesture Recognition market