1 / 49

Continuous Hand Gesture Recognition from Skeleton Data Using LSTM/GRU Networks

Implementing a system for continuous hand gesture recognition based on skeleton data from the Dynamic Hand Gesture 14/28 dataset. Utilizing LSTM/GRU networks and advanced methods like CTC to capture hand motion and improve interaction systems in AR and VR environments.

thood
Download Presentation

Continuous Hand Gesture Recognition from Skeleton Data Using LSTM/GRU Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Skeleton based Continuous Gesture Recognition Chi Tsung,Chang

  2. Outline • Introduction • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

  3. Introduction • Hand Gesture recognition – Understand the meaning of some motion of the hand • Static hand gesture • Dynamic hand gesture • Dynamic hand gesture type • Presegmented • Continuous(weakly segmented)

  4. Introduction • Dataset – Dynamic hand gesture14/28 dataset • Dynamic hand gesture • Using Intel RealSense • Contain Depth and skeleton (xyz coordinate /by Intel RSSDK) • 14 Gesture / 20 subjects / perform 5 times / 2 ways (one finger / whole hand)  2800 sequences (http://www-rech.telecom-lille.fr/DHGdataset/)

  5. Introduction • Target : Using the dataset DHG 14/28 to implement a continuous hand gesture recognition system based on skeleton • Why skeleton: • the feature can fully represent hand motion • for interaction system in the environment like AR and VR, usually hand’s pose (skeleton) is known, thus should take this information to do the gesture recognition

  6. Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

  7. Recurrent network • 1.Self loop with input sequence • 2. sharing parameters through time (“Deep learning”,Ian Goodfellow and Yoshua Bengio and Aaron Courville,2016, fig 10.2)

  8. Recurrent Neural Network Training: back propagation through time(BPTT) (“Deep learning”, fig 10.3)

  9. Long term dependency • Sometimes, there will be relation with two objects have long term dependency (ex:  “the clouds are in the sky”) (“Understanding LSTM Networks”,  Christopher Olah)

  10. Long term dependency challenge • Information explode or vanish • Gradient problems Gradient Vanish / explode Network will be easy to converge into having short term memory (“Understanding LSTM Networks”,  Christopher Olah)

  11. Long – short term memory(LSTM) • Using gate to control information flow • Can choose the parts to be memorized or replaced (“Unsupervised Learning of Video Representation using LSTMs”,Nitish Srivasta,2015)

  12. Gated recurrent unit(GRU) • Introduced by Kyunghyun Cho, 2014 • Make it simple  one gate control both input and forget / no output gate • Construct dependency between new and old information reset gate Update gate (“Understanding LSTM Networks”,  Christopher Olah)

  13. LSTM & GRU • LSTM: separate GRU: combine • GRU: always have output • LSTM: new state will consider all old state GRU: pick part of old state

  14. Multilayer RNN • Stack RNN layer by layer • Time Scale: By stacking RNN, higher layer can have a longer memorizing scale.[1] … … …

  15. Related work -1 • “Skeleton-based Dynamic hand gesture recognition”,CVPR 2016 • Hand – Crafted feature + SVM Skeleton - based Depth - based

  16. Related work -2 • VIVA Hand Gesture Challenge,2015 • Intensity + Depth

  17. Dynamic hand gesture • “Hand Gesture Recognition with 3D Convolutional Neural Networks”,2015

  18. Related work -3 3Dynamic hand gesture – continuous/weakly segmented NVIDIA Dynamic Hand Gesture Dataset,2016

  19. Dynamic hand gesture – continuous/weakly segmented • CTC: sequence to sequence learning by maximize log likelihood of the label sequence given the input sequence • Allow softmax for “blank” class

  20. Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

  21. Model • Multiple layer RNN using LSTM or GRU • Share the same units number • 1 layer of fully connect layer (14 output units)at output Output(possibility) Input(Skeleton) … … …

  22. Data Augmentation • Rotate about x,y,zaxises , randomly

  23. Normalization • Normalize with related to wrist  hand pose information • Keep wrist’s coordinate  tracking information

  24. Hyper parameter Vs performance • 90% training /10% testing • Only have limited influence on the performance

  25. Continuous gesture recognition • Method: Sliding window and threshold • Criterion: Recall/Precision/F1 Score ……. Swipe X Video Swipe left Swipe up Grab ……. time Label F1 score:

  26. Continuous gesture recognition • 12 sequences, each contains 4~8 gestures • Total possible combination= 14*14=196 • Now only about 50 cases • Skeleton input only

  27. Result • X axis: Sliding window size • Low precision: Sub gesture

  28. Output VS Ground Truth

  29. Sub gesture problem X axis: Percentage Y axis: Possibility • Output of the presegmented input

  30. Range Problem • Too Close or far (tracking information) Influence the output • Relative replacement of the track of the wrist

  31. Single: Relative VS absolute Number: augmentation Relative displacement perform better at the best Performance of recall and precision But F1 score is mainly influenced by precision

  32. Absolute VSRelative

  33. Single: 3layers vs 2layers Tree(2 layers) Best recall: MAX: 25, rate: 0.828 Best precision: MAX: 35, rate: 0.269 Best F1: MAX: 27, rate: 0.396 3 layers Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410

  34. RNN forest • Train 3 model to do the prediction • Training sets are the same, but the grouping of minibatches are different • For one model: 1 input -> 2 layers RNN(GRU) • Got a slightly better performance for the same testing set (93.5%~94.5%) Training set RNN RNN RNN RNN RNN RNN ALL ALL ALL Average Output

  35. Forest VS single • Forest:2 layer RNN*3 single:3 layer RNN*1 , all 150 units per layer • Forest perform more stable, but of course take time Relative Best recall: MAX: 19, rate: 0.766 Best precision: MAX: 35, rate: 0.397 Best F1: MAX: 29, rate: 0.480 Single Best recall: MAX: 23, rate: 0.818 Best precision: MAX: 37, rate: 0.297 Best F1: MAX: 37, rate: 0.410

  36. Average the probability Improvement F1:0.4216 -> 0.5 False alarm decrease 0.278 -> 0.375 Recall decrease 0.778 ->0.75

  37. Problem to solve • Precision still too low  sub gesture  need to output the possibility for every class every time step • Should only output the definite result at specific time

  38. Outline • Introduction • Hand gesture • Dataset • Background Knowledge • LSTM/GRU • Multilayer RNN • Related work • Model and Result • Advanced method – CTC • Conclusion

  39. Sequence to Sequence labelling • Output label Input sequence length • No need to output at specific time, output when the answer is clear • CTC – Connectionist Temporal Classification

  40. A Path to Path mapping • Introduced a “blank” label • Mapping : F(- - A – B - )=AB F(- AB ---)=AB F(A---B-)=AB • “blank ” enable output no class during transition / silence … etc. Fig from [2]

  41. Calculate the likelihood • From input to output distribution • From all possible output to label sequence  objective function

  42. Forward – Backward Algorithm • Consider all possible output path  too many combination with long input sequence Fig from [2]

  43. Forward – Backward Algorithm • Dynamic Programing : calculate the likelihood recursively • Forward • Backward seamless: • For calculating Loss w.r.t. each time’s output

  44. Training • Slow(140k iteration with 32 samples per batch) and converge at poor performance point ( negative log likelihood ~=1.5, likelihood ~=0.2213)

  45. Batch normalization • Normalize mean and variance

  46. Improved result • Might need more time, but the difference is obvious

  47. Output problem • Blank only: ideally: most blank, with some spike • Start with arbitrarily class with high possibility • The class might the predict result of this sequence, so ctc loss cannot go down

  48. Conclusion and problems waiting to solve • Sliding window leave serious problem with variable length of different gesture • Sub gesture problem is critical for precision (false alarm) • CTC should have the ability to solve the problem by giving “blank” class • Problem1: loss still too high (0.58), need more time? Structure? • Problem 2: blank dominate all time

  49. Ref • [1] “Training and analyzing deep recurrent neural networks” • [2] “Supervised Sequence Labelling with Recurrent Neural Networks”,AlexGraves • [3] “Deep learning”,Ian Goodfellow and YoshuaBengio and Aaron Courville,2016 • [4] “Understanding LSTM Networks”,  Christopher Olah • [5] “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,SergeyIoffe, Christian Szegedy,2015

More Related