Real Time Gesture Recognition of Human Hand

Real Time Gesture Recognition of Human Hand Wu Hai Atid Shamaie Alistair Sutherland

Overview: • What are gestures? • What can gestures be used for? • How to find a hand in an image? • How to recognise its shape? • How to recognise its motion? • How to find its position in 3D space?

What is Gesture? A movement of a limb or the body as an expression of thought or feeling. --Oxford Concise Dictionary 1995

Mood, emotion • Mood and emotion are expressed by body language • Facial expressions • Tone of voice • Allows computers to interact with human beings in a more natural way

Human Computer Interface using Gesture • Replace mouse and keyboard • Pointing gestures • Navigate in a virtual environment • Pick up and manipulate virtual objects • Interact with a 3D world • No physical contact with computer • Communicate at a distance

Public Display Screens • Information display screens • Supermarkets • Post Offices, Banks • Allows control without having to touch the device

Sign Language • 5000 gestures in vocabulary • each gesture consists of a hand shape, a hand motion and a location in 3D space • facial expressions are important • full grammar and syntax • each country has its own Sign language • Irish Sign Language is different from British Sign Language or American Sign Language

Datagloves

Datagloves • Datagloves provide very accurate measurements of hand-shape • But are cumbersome to wear • Expensive • Connected by wires- restricts freedom of movement

Datagloves - the future • Will get lighter and more flexible • Will get cheaper ~ $100 • Wireless?

Our vision-based system Wireless & Flexible No specialised hardware Single Camera Real-time

Coloured Gloves • User must wear coloured gloves • Very cheap • Easy to put on • BUT get dirty • Eventually we wish to use natural skin

Colour Segment Noise Removal 32 32 Scale by Area

Demo • Gesture Video

Feature Space Each point represents a different image Clusters of points represent different hand-shapes Distance between points depends on how similar the images are

A continuous gesture creates a trajectory in feature space We can project a new image onto the trajectory

Multiple sub-spaces Classifying a new unknown image Gesture 2 Gesture 1 Global space

3D spatial position of hand y x camera Subspaces and trajectories calculated with hand at origin We know the image co-ordinates and the area of the hand in the original image We can calculate depth and xy-position

Yes/No? Yes/No? Yes/No? Yes/No? Y A B C

Hierarchical Search • We need to search thousands of images • How to do this efficiently? • We need to use a “coarse-to-fine”search strategy

Blurring Factor = 1 Original image Blurring Factor = 2 Blurring Factor = 3

Multi-scale Hierarchy Factor = 3.0 Factor = 2.0 Factor = 1.0

Motion Recognition • Hidden Markov Model ( HMM ) • --- time sequence of images modeling HMM1 (Hello) f P(f |HMM1) P(f |HMM2) HMM2 (Good) HMM3(Bad) HMM4 (House)

Prediction and Tracking • Given previous frames we can predict what will happen next • Speeds up search. • occlusions -

Co-articulation In fluent dialogue signs are modified by preceding and following signs intermediate forms A B

Future Work: • Occlusions (Atid) • Grammars in Irish Sign Language. --- Sentence Recognition • Body Language.

Face Recognition

A noisy environment

Errors

Model-based Recognition

Pose-tracking

Facial Expressions Anger Fear Disgust Happy Sad Surprise

Human Body Tracking

Face Recognition • Summary • Single pose • Multiple pose • Principal components analysis • Model-based recognition • Neural Networks

Single Pose • Standard head-and-shoulders view with uniform background • Easy to find face within image

Aligning Images Alignment • Faces in the training set must be aligned with each other to remove the effects of translation, scale, rotation etc. • It is easy to find the position of the eyes and mouth and then shift and resize images so that are aligned with each other

Nearest Neighbour • Once the images have been aligned you can simply search for the member of the training set which is nearest to the test image. • There are a number of measures of distance including Euclidean distance, and the cross-correlation

Principal Components • PCA reduces the number of dimensions and so the memory requirement is much reduced. • The search time is also reduced

Two ways to apply PCA (1) • We could apply PCA to the whole training set. • Then each face is represented by a point in the PC space • We could then apply nearest neighbour to these points

Two ways to apply PCA (2) • Alternatively we could apply PCA to the set of faces belonging to each person in the training set • Each class (person) is then reprented by a different ellipsoid and Mahalanobis distance can be used to classify a new unknown face • You need a lot of images of each person to do this

Problems with PCA • The same person may sometimes appear differently due to • Beards, moustaches • Glasses, • Makeup • These have to be represented by different ellipsoids

-------(2)--------------(3)--------------(4)------- -------(5)--------------(6)--------------(7)------- -------(8)--------------(9)--------------(10)-------

Problems with PCA • Facial expressions • Differing facial expressions • Opening and closing the mouth • Raised eyebrows • Widening the eyes • Smiling, frowing etc, • These mean that the class is no longer ellipsoidal and must be represented by a manifold

Facial Expressions • There are six types of facial expression • We could use PCA on the eyes and mouth – so we could have eigeneyes and eigenmouths Anger Fear Disgust Happy Sad Surprise

Multiple Poses • Heads must now be aligned in 3D world space • Classes now form trajectories in feature space • It becomes difficult to recognise faces because the variation due to pose is greater than the variation between people

Model-based Recognition • We can fit a model directly to the face image • Model consists of a mesh which is matched to facial features such as the eyes, nose, mouth and edges of the face. • We use PCA to describe the parameters of the model rather than the pixels.

Model-based Recognition • The model copes better with multiple poses and changes in facial expression.

Real Time Gesture Recognition of Human Hand