1 / 1

Hand Signals Recognition from Video Using 3D Motion Capture Archive

Hand Signals Recognition from Video Using 3D Motion Capture Archive. Tai-Peng Tian Stan Sclaroff. B OSTON U NIVERSITY. Computer Science Department.

salaam
Download Presentation

Hand Signals Recognition from Video Using 3D Motion Capture Archive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff BOSTON UNIVERSITY Computer Science Department Table 1 : Confusion Matrix. Each row contains outcome of classifying queries drawn from the same category. Diagonal entries represent correct classification. 2D feature locations in image sequence Equation 1: Dissimilarity Measure. P(.) applies the projection matrix. Equation 2: Recursive solution for the DTW alignment 3D feature locations in motion capture sequence Figure 4 : Classifier performance with respect to increasing noise Computer Science I. Introduction 3D motion capture sequence from an archive. 2D Features 3. Experiments and Results 2D-3D Matching Algorithm. Problem Definition : Given a sequence of tracked 2D feature locations, find the best matching 3D motion capture sequence from an archive. Motivation : Hand signals are commonly used for communication in noisy environments or when people are out of voice range. Examples include directing an airplane to the runway for take off, controlling traffic flow, basketball referee signals, etc. Contribution : No direct 3D structure estimation is needed. The most relevant work is Parameswaran and Chellapa [1]. We proposed a simpler alternative to the 2D-3D motion matching problem that also offers viewpoint invariance. Figure 1: Basketball Referee Hand Signal Assumptions : We focus on the recognition part of the algorithm, thus we assume that the video sequence has been temporally segmented and the desired 2D feature locations can be reliably tracked over the whole sequence. Within each sequence of 2D features, we further assume that there is only one hand signal. Data : 45 motion capture sequences of basketball referee gestures: http://mocap.cs.cmu.edu. 2D image features were synthesized from the 3D motion capture sequences using a frontal view and scaled to unit height. Approximately half of the data were used as prototypes in the archive and the other half used for testing. Why 3D Motion capture archive? The representation is more complete than a 2D representation as there is no need to sample the motion from multiple views. Classifier : Experiments are conducted using the nearest neighbor classifier. Hence given a sequence of 2D features, the 3D motion sequence with the lowest alignment score is deemed the best match. 2. Algorithm Overview : 2D vs 3D sequence alignment using Dynamic Time Warping Description of Experiments : Three sets of experiments are conducted with different set of features at different noise level. The first experiment uses all 31 feature shown in Fig 1, with increasing noise. The second experiment uses a set of more realistic features points indicated by the shaded points in Fig 1. The last experiment uses only shaded points in the upper body of Fig 1. Why DTW? The algorithm provides an optimal alignment between sequences thus we do not have to worry about variations in the speed of the motion. Significance of Noise Parameter : In the synthesized images, the person is of unit height. Suppose we are tracking a person 300 pixels tall, an error margin of 0.06 in normalized coordinates simulates a tracker that reports tracked points within a 36 pixel radius 95% of the time. Dissimilarity score : Given at least six pairs of 2D to 3D correspondences in a frame, the projection matrix M can be estimated. Given M, the back-projection error of the 3D points is used as the dissimilarity score. Future Work : Currently there are no temporal constraints on computing the projection matrix from frame to frame. Temporal consistency can be enforced during the matching process to improve robustness. Figure 2: An example of a DTW matching 4. References 2D vs 3D alignment : Once we are able to compute the dissimilarity between a frame of 2D and 3D features, the Dynamic Time Warping (DTW) algorithm can proceed as usual. The DTW algorithm finds the optimal alignment by minimizing the dissimilarity cost. [1] V. Parameswaran and R. Chellapa. View invariants for human action recognition. In CVPR 2003.

More Related