Real-Time Vision-Based Gesture Recognition Using Haar-like Features

Real-Time Vision-Based Gesture Recognition Using Haar-like Features By: Qing Chen, Nicolas D. Georganas and Emil M. Petriu IMTC 2007, Warsaw, Poland, May 1-3, 2007

Outline • 1. Introduction • 2. Two-level Approach • 3. Posture Recognition • 4. Gesture Recognition • 5. Conclusions

1. Introduction • Human-Virtual Environment (VE) interaction requires utilizing different modalities (e.g. speech, body position, hand gestures, haptic response, etc.) and integrating them together for a more immersive user experience. • Hand gestures are a intuitive yet powerful communication modality which has not been fully explored for H-VE interaction. • The latest computer vision, image processing techniques make real-time vision-based hand gesture recognition feasible for human-computer interaction. • Vision-based hand gesture recognition system needs to meet the requirements in terms of real-time performance, robustness and accurate recognition.

1. Introduction (cont’d) • Vision-based gesture recognition techniques can be divided into two categories: • Appearance-based approaches:√- Pros: simple hand models; efficient implementation; real-time performance easier to achieve.- Cons: limited capability to model 3D hand gestures.- We choose this approach to achieve the real-time performance. • 3D hand model-based approaches:- Pros: potentiality to model more natural hand gestures.- Cons: complex hand model; real-time performance is difficult; user-dependent.

2. Two-level Approach • Definition 1 (Posture/Pose) A posture or pose is defined solely by the (static) hand configurations and hand locations. • Definition 2 (Gesture) A gesture is a series of postures over a time span connected by motions (global hand motion and local finger motion).

2. Two-level Approach (cont’d) • With the hierarchical nature of the definition, it is natural to decouple the gesture classification problem into two levels: • Lower-level: recognition of primitives (postures); • Solution: Viola and Jones algorithm • Higher-level: recognition of structure (gesture); • Solution: Grammar-based analysis Posture level Viola & Jones Algorithm Gesture level Grammar-based analysis

3. Posture Recognition • Viola and Jones Algorithm (2001): • A statistical approach originally for the task of human face detection and tracking. • 15 times faster than any previous face detection approaches while achieving equivalent accuracy to the best published results. • Employed 3 techniques : • Haar-like features • Integral image • AdaBoosting Learning algorithm • Issues for hand postures: • Applicability • Classification besides detection • Selection of posture sets • Calibration

3. Posture Recognition (cont’d) • Haar-like features: • The value of a Haar-like feature: f(x)=Sumblack rectangle (pixel gray level) – Sumwhite rectangle (pixel gray level) • Compared with raw pixels, Haar-like features can reduce/increase the in-class/out-of-class variability, and thus making classification easier. Figure 1: The set of basic Haar-like features. Figure 2: The set of extended Haar-like features.

A B P1 P2 D C P3 P4 P (x, y) The rectangle Haar-like features can be computed rapidly using “integral image”. Integral image at location of x, y contains the sum of the pixel values above and left of x, y, inclusive: The sum of pixel values within “D” can be computed by : P1 +P4-P2 -P3 3. Posture Recognition (cont’d)

3. Posture Recognition (cont’d) • To detect the hand, the image is scanned by a sub-window containing a Haar-like feature. • Based on each Haar-like feature fj , a weak classifier hj(x) is defined as:where x is a sub-window, and θis a threshold. pj indicating the direction of the inequality sign.

3. Posture Recognition (cont’d) • In machine vision: • HARD to find a single accurate classification rule; • EASY to find rules with classification accuracy slightly better than 50% (weak classifiers) . • AdaBoosting (Adaptive Boosting) is an iterative algorithm to improve the accuracy stage by stage based on a series of weak classifiers. • Adaptive: later classifiers are tuned up in favor of the samples misclassified by previous classifiers.

3. Posture Recognition (cont’d) • Adaboost starts with a uniform distribution of “weights” over training examples. The weights tell the learning algorithm the importance of the example. • Obtain a weak classifier from the weak learning algorithm, hj(x). • Increase the weights on the training examples that were misclassified. • (Repeat) • At the end, carefully make a linear combination of the weak classifiers obtained at all iterations.

3. Posture Recognition (cont’d) • A series of classifiers are applied to every sub-window. • The first classifier: • Eliminates a large number of negative sub-windows; • pass almost all positive sub-windows (high false positive rate) with very little processing. • Subsequent layers eliminate additional negatives sub-windows (passed by the first classifier) but require more computation. • After several stages of processing the number of negative sub-windows have been reduced radically.

3. Posture Recognition (cont’d) • Four hand postures have been tested with Viola & Jones algorithm: • Input device: A low cost Logitech QuickCam web-camera with a resolution of 320 × 240 up at 15 frames-per-second.

3. Posture Recognition (cont’d) • Training samples collection: • Negative samples: images that must not contain object representations. We collected 500 random images as negative samples. • Positive samples: hand posture images that are collected from humans hand, or generated with a 3D hand model. For each posture, we collected around 450 positive samples. As the initial test, we use the white wall as the background.

3. Posture Recognition (cont’d) • After the training process based on the AdaBoosting learning algorithm, we get a cascade classifier for each hand posture when the required accuracy is achieved: • “Two-finger” posture: 15 stage cascade classifier; • “Palm” posture: 10 stage cascade classifier; • “Fist” posture: 15 stage cascade classifier; • “Little finger” posture: 14 stage cascade classifier. • The performance of trained classifiers for 100 testing images:

3. Posture Recognition (cont’d) • To recognize these different hand postures, a parallel structure that includes all of the cascade classifiers is implemented:

3. Posture Recognition (cont’d) • The real-time performance of the posture recognition:

4. Gesture Recognition • As a gesture is a series of postures, a grammar-based syntactic analysis is suitable to describe the composite gestures based on postures, and thus enables the system to recognize the gestures based on their representations. • For pattern recognition, a grammar G= (N, T, P, S) • A finite set N of non-terminal symbols; • A finite set T of terminal symbols that is disjoint from N; • A finite set P of production rules; • A distinguished symbol S Nthat is the start symbol. • Issues in modeling the structure of hand gestures: • Choice of basic primitives • Choice of appropriate grammar type (context free, stochastic context free, regular, HMM)

5. Conclusions • The parallel cascade structure based Haar-like features and the AdaBoosting learning algorithm can achieve satisfactory real-time hand posture classification results; • The experiment result shows the Viola and Jones algorithm has very robust performance against scale invariance and a certain degree of robustness against in-plane rotation (±15˚) and out-of-plane rotation; • Viola and Jones algorithm also shows good performance for different illumination conditions, but poor performance for different backgrounds; • A two-level architecture that can capture the hierarchical nature of gesture classification is proposed: the lower level focused on the posture recognition while the higher level focused on the description of composite gestures using grammar-based syntactic analysis.

Dziekuje 

Real-Time Vision-Based Gesture Recognition Using Haar-like Features