500 likes | 512 Views
Implementing a system for continuous hand gesture recognition based on skeleton data from the Dynamic Hand Gesture 14/28 dataset. Utilizing LSTM/GRU networks and advanced methods like CTC to capture hand motion and improve interaction systems in AR and VR environments.
E N D
Skeleton based Continuous Gesture Recognition Chi Tsung,Chang
Outline • Introduction • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion
Introduction • Hand Gesture recognition – Understand the meaning of some motion of the hand • Static hand gesture • Dynamic hand gesture • Dynamic hand gesture type • Presegmented • Continuous(weakly segmented)
Introduction • Dataset – Dynamic hand gesture14/28 dataset • Dynamic hand gesture • Using Intel RealSense • Contain Depth and skeleton (xyz coordinate /by Intel RSSDK) • 14 Gesture / 20 subjects / perform 5 times / 2 ways (one finger / whole hand) 2800 sequences (http://www-rech.telecom-lille.fr/DHGdataset/)
Introduction • Target : Using the dataset DHG 14/28 to implement a continuous hand gesture recognition system based on skeleton • Why skeleton: • the feature can fully represent hand motion • for interaction system in the environment like AR and VR, usually hand’s pose (skeleton) is known, thus should take this information to do the gesture recognition
Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion
Recurrent network • 1.Self loop with input sequence • 2. sharing parameters through time (“Deep learning”,Ian Goodfellow and Yoshua Bengio and Aaron Courville,2016, fig 10.2)
Recurrent Neural Network Training: back propagation through time(BPTT) (“Deep learning”, fig 10.3)
Long term dependency • Sometimes, there will be relation with two objects have long term dependency (ex: “the clouds are in the sky”) (“Understanding LSTM Networks”, Christopher Olah)
Long term dependency challenge • Information explode or vanish • Gradient problems Gradient Vanish / explode Network will be easy to converge into having short term memory (“Understanding LSTM Networks”, Christopher Olah)
Long – short term memory(LSTM) • Using gate to control information flow • Can choose the parts to be memorized or replaced (“Unsupervised Learning of Video Representation using LSTMs”,Nitish Srivasta,2015)
Gated recurrent unit(GRU) • Introduced by Kyunghyun Cho, 2014 • Make it simple one gate control both input and forget / no output gate • Construct dependency between new and old information reset gate Update gate (“Understanding LSTM Networks”, Christopher Olah)
LSTM & GRU • LSTM: separate GRU: combine • GRU: always have output • LSTM: new state will consider all old state GRU: pick part of old state
Multilayer RNN • Stack RNN layer by layer • Time Scale: By stacking RNN, higher layer can have a longer memorizing scale.[1] … … …
Related work -1 • “Skeleton-based Dynamic hand gesture recognition”,CVPR 2016 • Hand – Crafted feature + SVM Skeleton - based Depth - based
Related work -2 • VIVA Hand Gesture Challenge,2015 • Intensity + Depth
Dynamic hand gesture • “Hand Gesture Recognition with 3D Convolutional Neural Networks”,2015
Related work -3 3Dynamic hand gesture – continuous/weakly segmented NVIDIA Dynamic Hand Gesture Dataset,2016
Dynamic hand gesture – continuous/weakly segmented • CTC: sequence to sequence learning by maximize log likelihood of the label sequence given the input sequence • Allow softmax for “blank” class
Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion
Model • Multiple layer RNN using LSTM or GRU • Share the same units number • 1 layer of fully connect layer (14 output units)at output Output(possibility) Input(Skeleton) … … …
Data Augmentation • Rotate about x,y,zaxises , randomly
Normalization • Normalize with related to wrist hand pose information • Keep wrist’s coordinate tracking information
Hyper parameter Vs performance • 90% training /10% testing • Only have limited influence on the performance
Continuous gesture recognition • Method: Sliding window and threshold • Criterion: Recall/Precision/F1 Score ……. Swipe X Video Swipe left Swipe up Grab ……. time Label F1 score:
Continuous gesture recognition • 12 sequences, each contains 4~8 gestures • Total possible combination= 14*14=196 • Now only about 50 cases • Skeleton input only
Result • X axis: Sliding window size • Low precision: Sub gesture
Sub gesture problem X axis: Percentage Y axis: Possibility • Output of the presegmented input
Range Problem • Too Close or far (tracking information) Influence the output • Relative replacement of the track of the wrist
Single: Relative VS absolute Number: augmentation Relative displacement perform better at the best Performance of recall and precision But F1 score is mainly influenced by precision
Single: 3layers vs 2layers Tree(2 layers) Best recall: MAX: 25, rate: 0.828 Best precision: MAX: 35, rate: 0.269 Best F1: MAX: 27, rate: 0.396 3 layers Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410
RNN forest • Train 3 model to do the prediction • Training sets are the same, but the grouping of minibatches are different • For one model: 1 input -> 2 layers RNN(GRU) • Got a slightly better performance for the same testing set (93.5%~94.5%) Training set RNN RNN RNN RNN RNN RNN ALL ALL ALL Average Output
Forest VS single • Forest:2 layer RNN*3 single:3 layer RNN*1 , all 150 units per layer • Forest perform more stable, but of course take time Relative Best recall: MAX: 19, rate: 0.766 Best precision: MAX: 35, rate: 0.397 Best F1: MAX: 29, rate: 0.480 Single Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410
Average the probability Improvement F1:0.4216 -> 0.5 False alarm decrease 0.278 -> 0.375 Recall decrease 0.778 ->0.75
Problem to solve • Precision still too low sub gesture need to output the possibility for every class every time step • Should only output the definite result at specific time
Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion
Sequence to Sequence labelling • Output label Input sequence length • No need to output at specific time, output when the answer is clear • CTC – Connectionist Temporal Classification
A Path to Path mapping • Introduced a “blank” label • Mapping : F(- - A – B - )=AB F(- AB ---)=AB F(A---B-)=AB • “blank ” enable output no class during transition / silence … etc. Fig from [2]
Calculate the likelihood • From input to output distribution • From all possible output to label sequence objective function
Forward – Backward Algorithm • Consider all possible output path too many combination with long input sequence Fig from [2]
Forward – Backward Algorithm • Dynamic Programing : calculate the likelihood recursively • Forward • Backward seamless: • For calculating Loss w.r.t. each time’s output
Training • Slow(140k iteration with 32 samples per batch) and converge at poor performance point ( negative log likelihood ~=1.5, likelihood ~=0.2213)
Batch normalization • Normalize mean and variance
Improved result • Might need more time, but the difference is obvious
Output problem • Blank only: ideally: most blank, with some spike • Start with arbitrarily class with high possibility • The class might the predict result of this sequence, so ctc loss cannot go down
Conclusion and problems waiting to solve • Sliding window leave serious problem with variable length of different gesture • Sub gesture problem is critical for precision (false alarm) • CTC should have the ability to solve the problem by giving “blank” class • Problem1: loss still too high (0.58), need more time? Structure? • Problem 2: blank dominate all time
Ref • [1] “Training and analyzing deep recurrent neural networks” • [2] “Supervised Sequence Labelling with Recurrent Neural Networks”,AlexGraves • [3] “Deep learning”,Ian Goodfellow and YoshuaBengio and Aaron Courville,2016 • [4] “Understanding LSTM Networks”, Christopher Olah • [5] “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,SergeyIoffe, Christian Szegedy,2015