Continuous Hand Gesture Recognition from Skeleton Data Using LSTM/GRU Networks

Skeleton based Continuous Gesture Recognition Chi Tsung,Chang

Outline • Introduction • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

Introduction • Hand Gesture recognition – Understand the meaning of some motion of the hand • Static hand gesture • Dynamic hand gesture • Dynamic hand gesture type • Presegmented • Continuous(weakly segmented)

Introduction • Dataset – Dynamic hand gesture14/28 dataset • Dynamic hand gesture • Using Intel RealSense • Contain Depth and skeleton (xyz coordinate /by Intel RSSDK) • 14 Gesture / 20 subjects / perform 5 times / 2 ways (one finger / whole hand)  2800 sequences (http://www-rech.telecom-lille.fr/DHGdataset/)

Introduction • Target : Using the dataset DHG 14/28 to implement a continuous hand gesture recognition system based on skeleton • Why skeleton: • the feature can fully represent hand motion • for interaction system in the environment like AR and VR, usually hand’s pose (skeleton) is known, thus should take this information to do the gesture recognition

Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

Recurrent network • 1.Self loop with input sequence • 2. sharing parameters through time (“Deep learning”,Ian Goodfellow and Yoshua Bengio and Aaron Courville,2016, fig 10.2)

Recurrent Neural Network Training: back propagation through time(BPTT) (“Deep learning”, fig 10.3)

Long term dependency • Sometimes, there will be relation with two objects have long term dependency (ex: “the clouds are in the sky”) (“Understanding LSTM Networks”, Christopher Olah)

Long term dependency challenge • Information explode or vanish • Gradient problems Gradient Vanish / explode Network will be easy to converge into having short term memory (“Understanding LSTM Networks”, Christopher Olah)

Long – short term memory(LSTM) • Using gate to control information flow • Can choose the parts to be memorized or replaced (“Unsupervised Learning of Video Representation using LSTMs”,Nitish Srivasta,2015)

Gated recurrent unit(GRU) • Introduced by Kyunghyun Cho, 2014 • Make it simple  one gate control both input and forget / no output gate • Construct dependency between new and old information reset gate Update gate (“Understanding LSTM Networks”, Christopher Olah)

LSTM & GRU • LSTM: separate GRU: combine • GRU: always have output • LSTM: new state will consider all old state GRU: pick part of old state

Multilayer RNN • Stack RNN layer by layer • Time Scale: By stacking RNN, higher layer can have a longer memorizing scale.[1] … … …

Related work -1 • “Skeleton-based Dynamic hand gesture recognition”,CVPR 2016 • Hand – Crafted feature + SVM Skeleton - based Depth - based

Related work -2 • VIVA Hand Gesture Challenge,2015 • Intensity + Depth

Dynamic hand gesture • “Hand Gesture Recognition with 3D Convolutional Neural Networks”,2015

Related work -3 3Dynamic hand gesture – continuous/weakly segmented NVIDIA Dynamic Hand Gesture Dataset,2016

Dynamic hand gesture – continuous/weakly segmented • CTC: sequence to sequence learning by maximize log likelihood of the label sequence given the input sequence • Allow softmax for “blank” class

Model • Multiple layer RNN using LSTM or GRU • Share the same units number • 1 layer of fully connect layer (14 output units)at output Output(possibility) Input(Skeleton) … … …

Data Augmentation • Rotate about x,y,zaxises , randomly

Normalization • Normalize with related to wrist  hand pose information • Keep wrist’s coordinate  tracking information

Hyper parameter Vs performance • 90% training /10% testing • Only have limited influence on the performance

Continuous gesture recognition • Method: Sliding window and threshold • Criterion: Recall/Precision/F1 Score ……. Swipe X Video Swipe left Swipe up Grab ……. time Label F1 score:

Continuous gesture recognition • 12 sequences, each contains 4~8 gestures • Total possible combination= 14*14=196 • Now only about 50 cases • Skeleton input only

Result • X axis: Sliding window size • Low precision: Sub gesture

Output VS Ground Truth

Sub gesture problem X axis: Percentage Y axis: Possibility • Output of the presegmented input

Range Problem • Too Close or far (tracking information) Influence the output • Relative replacement of the track of the wrist

Single: Relative VS absolute Number: augmentation Relative displacement perform better at the best Performance of recall and precision But F1 score is mainly influenced by precision

Absolute VSRelative

Single: 3layers vs 2layers Tree(2 layers) Best recall: MAX: 25, rate: 0.828 Best precision: MAX: 35, rate: 0.269 Best F1: MAX: 27, rate: 0.396 3 layers Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410

RNN forest • Train 3 model to do the prediction • Training sets are the same, but the grouping of minibatches are different • For one model: 1 input -> 2 layers RNN(GRU) • Got a slightly better performance for the same testing set (93.5%~94.5%) Training set RNN RNN RNN RNN RNN RNN ALL ALL ALL Average Output

Forest VS single • Forest:2 layer RNN*3 single:3 layer RNN*1 , all 150 units per layer • Forest perform more stable, but of course take time Relative Best recall: MAX: 19, rate: 0.766 Best precision: MAX: 35, rate: 0.397 Best F1: MAX: 29, rate: 0.480 Single Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410

Average the probability Improvement F1:0.4216 -> 0.5 False alarm decrease 0.278 -> 0.375 Recall decrease 0.778 ->0.75

Problem to solve • Precision still too low  sub gesture  need to output the possibility for every class every time step • Should only output the definite result at specific time

Sequence to Sequence labelling • Output label Input sequence length • No need to output at specific time, output when the answer is clear • CTC – Connectionist Temporal Classification

A Path to Path mapping • Introduced a “blank” label • Mapping : F(- - A – B - )=AB F(- AB ---)=AB F(A---B-)=AB • “blank ” enable output no class during transition / silence … etc. Fig from [2]

Calculate the likelihood • From input to output distribution • From all possible output to label sequence  objective function

Forward – Backward Algorithm • Consider all possible output path  too many combination with long input sequence Fig from [2]

Forward – Backward Algorithm • Dynamic Programing : calculate the likelihood recursively • Forward • Backward seamless: • For calculating Loss w.r.t. each time’s output

Training • Slow(140k iteration with 32 samples per batch) and converge at poor performance point ( negative log likelihood ~=1.5, likelihood ~=0.2213)

Batch normalization • Normalize mean and variance

Improved result • Might need more time, but the difference is obvious

Output problem • Blank only: ideally: most blank, with some spike • Start with arbitrarily class with high possibility • The class might the predict result of this sequence, so ctc loss cannot go down

Conclusion and problems waiting to solve • Sliding window leave serious problem with variable length of different gesture • Sub gesture problem is critical for precision (false alarm) • CTC should have the ability to solve the problem by giving “blank” class • Problem1: loss still too high (0.58), need more time? Structure? • Problem 2: blank dominate all time

Ref • [1] “Training and analyzing deep recurrent neural networks” • [2] “Supervised Sequence Labelling with Recurrent Neural Networks”,AlexGraves • [3] “Deep learning”,Ian Goodfellow and YoshuaBengio and Aaron Courville,2016 • [4] “Understanding LSTM Networks”, Christopher Olah • [5] “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,SergeyIoffe, Christian Szegedy,2015

Continuous Hand Gesture Recognition from Skeleton Data Using LSTM/GRU Networks

Continuous Hand Gesture Recognition from Skeleton Data Using LSTM/GRU Networks

Presentation Transcript

Gesture Recognition Market

Gesture Recognition

Gesture Recognition

Gesture Recognition Interface Device

Gesture Recognition

GESTURE BASED COMPUTING

Gesture Recognition

Vision-Based Gesture Recognition

Gesture Recognition Interface Device

Gesture Recognition / Sign Language

Gesture Recognition System

Gesture Recognition

Continuous Word Recognition

Gesture recognition

Overview on Gesture Recognition

Inertial Gesture Recognition

Gesture Recognition Market

Gesture Recognition market