230 likes | 330 Views
Leonid Karlinsky , Michael Dinerstein , Daniel Harari , and Shimon Ullman , Weizmann Institute of Science. The chains model for detecting object parts by their context. Some slides and images are taken from the author’s CVPR presentation and the internet. Recap-Problem.
E N D
Leonid Karlinsky, Michael Dinerstein, Daniel Harari, and Shimon Ullman, Weizmann Institute of Science The chains model for detecting object parts by their context Some slides and images are taken from the author’s CVPR presentation and the internet.
Recap-Problem • How can we detect parts of an object when: • The object is highly deformable • The part itself doesn’t contain much information
Detects a “difficult” part by extending from an “easy” reference part Recap-Goals & Intuition Use the face to disambiguate the hand
Recap-The chains model M,T • Pre-processing: • Detect the face • Extract features • Modeling the posterior: • Model the probability of the hand: P(Lh|F, Lf ) • T – ordered subset of M features Chains model
Steps Involved • Given: • Training Images with annotated hand (target) and face (reference). • Testing Images: Images with annotated face location. No hand location
Feature Extraction from test and train images Image Extract Boundary
Feature Extraction from test and train images Sample Boundary In the test images, filter out SIFT features that may not be useful to explain the hand. e.g. features that far away from the hand. Extract SIFT
Probability Computation of test image features • We need to compute the probabilities of the features in the test image. • Kernel density estimation used. • For a feature from the test image: • Retrieve K nearest features from the training images • Estimate the probability of the test image feature by fitting Gaussian kernels over the retrieved training image features.
Probability Computation of test image features Image taken from sfb649.wiwi.hu-berlin.de
Inference • MAP estimate • Exact inference: • Sum over all simple chains in the graph • Approximate inference: • Sum over all chains (limit the max. length)
Progress • Used: - The VLFeat library for extracting SIFT features.- The ANN library for approximate nearest neighbour search.- Viola Jones face detector for indentifying face locations. • Implemented parts of the algorithm:- Probability computation using ANN- Parts of the inference step. • Received code from the authors by the end of week 4. I compiled their code and generated some results. • Tested with SURF feature instead of SIFT
Experimental details • Data: • PCG data set from paper: 12 short movies (about 600 frames each) of people sitting and performing various actions with their hands. • Training: 11 movies for which the locations of the face and the hand in each frame are known. • Testing: The 12th movie, for which we only know the face locations.
Experimental details • Some issues: • Memory: Matlab throws out of memory error while loading SIFT features for all the frames of a movie. Hence for now, I use only the first 100 frames in each movie. • Time: It takes about 4 hours for training and 4 hours for testing on the above data set on a 32GB machine.
Some Results • Hand detection using face as reference and SIFT features. • Hand detection using face as reference and SURF features. • Face detection using hand as reference and SIFT features.
Plan • Get the code working for larger data sets. • Try other features instead of SIFT. • Quantitative evaluation. • Questions to answer: • How does the chains model work as the source to target distance increases. Say detect the feet using the face ? (Longer chains). • When there is no easily detectable reference part? • When there are multiple reference parts like face and human body center location. Propagate 2 kinds of chains to the hand.
Progress after mid term Memory problem when using bigger data sets • Bottleneck: Approximate nearest neighbour (ANN) • Kd-tree computation needs all features in memory. • Possible solution: Distributed kd trees, recent research. • Build parts of kd tree separately and then combine the parts (non-trivial).
Hand detection-some quantitative results on the PCG data set • Accuracy = Nh/ NNh= Number of images where hands detectedN = Total number of images having hands- All images have just one person- Hand is considered detected if it is within one half face width of the ground truth hand location. • Dense SIFT along edges: 84% • Dense SIFT over the upper body: 82% • SIFT key points: 69% • SURF key point and descriptor: 81%
Detection of hand (no gloves) • The algorithm doesn’t work that well on the PCG data set(one reason could be that there are very few hand only (no glove) images in PCG). • On the data set I created consisting of videos of people from the lab, the algorithm works very well for detecting hands. I will submit the new results.
Comparison of Chains model based hand detector to Viola-Jones object detector • Ground truth: - Recorded video of few people in my lab, annotated with hand and face locations. 800 positive examples and 1000 negative examples. • Trained the Viola-Jones object detector on the data set. • Chains model accuracy: 91.2% Viola-Jones accuracy: 72.6%
Chains model on fragments • Divide the image into meaningful fragments.- I use super pixels for now. SLIC algorithm. • Extract features from the fragment- I use the mean SIFT descriptor of the densely obtained SIFT descriptors belonging to the fragments. • When region size = 25, accuracy = 91.7% When region size = 40, accuracy = 61.64%(region size is indicative of the super pixel size). • The super pixel approach gives good results.
Conclusions • Detection of deformable, small objects is tough. Chains model, using context, performs better than Viola Jones object detector. • Using the chains model on fragments is a promising direction. As we can see, there is an optimum size for the super pixels to be useful for detecting the hand. The fragments have to be meaningful. • Chains model works on individual images. Even then it achieves good detection. If we include motion information, then it may do much better.
Future work • After getting the hand location, segmenting out the hand may be a good application. • When there is no easily detectable reference part in the image, will an EM kind of approach work? • When there are multiple reference parts like face and human body center location, can we propagate 2 kinds of chains to the hand? This will reduce the uncertainty in the hand location.