300 likes | 433 Views
A Scalable Tree-based Approach for Joint Object and Pose Recognition. Kevin Lai 1 , Liefeng Bo 1 , Xiaofeng Ren 2 , Dieter Fox 1,2 1. University of Washington, Seattle WA, USA 2. Intel Labs, Seattle WA, USA. Motivation.
E N D
A Scalable Tree-based Approach for Joint Object and Pose Recognition Kevin Lai1, Liefeng Bo1, Xiaofeng Ren2, Dieter Fox1,2 1. University of Washington, Seattle WA, USA 2. Intel Labs, Seattle WA, USA
Motivation • A unified framework for robust object recognition and pose estimation in real-time Object recognition Pose Estimation • Scale efficiently with the number of objects and poses • Incremental learning of new objects + Image sources: Deng et al, CVPR 2009; Muja 2009
Object and Pose Recognition Category Object Instance Pose Query
LEGO Augmented Reality • L. Bo, J. Fogarty, D. Fox, S. Grampurohit, B. Harrison, K. Lai, N. Landes, J. Lei, P.Powledge, X.Ren, R. Ziola
Scalable Recognition • Standard object recognition: • k-nearest neighbor classifier – scales linearly with data • One-versus-all classifiers, e.g. SVMs – scales linearly with # of classes • Scalable recognition: • Datasets are getting bigger and bigger, e.g. ImageNet[Deng et al. CVPR 2009] • A tree-based approach could scale sublinearly with the number of classes [Bengio et al. NIPS 2010]
? Object-Pose Tree Category Apple Stapler Bowl Cereal . . . . . . Instance Chex Bran Flakes Striped Bowl Blue Bowl . . . . . . . . . View . . . . . . Category: Cereal Instance: Bran Flakes Pose: 18° Pose
Object-Pose Tree Learning • Learn the parameters Wof all nodes in the tree. Given a set of N training samples (features , labels : • Extract features from both RGB and depth images: • Gradient and shape kernel descriptors over RGB and depth images[Bo, Ren, and Fox, NIPS 2010] • Overall objective function combines the category, instance, and pose label losses:
Object-Pose Tree Learning • Category label loss RC is the standard multi-class SVM loss with slack variables and hinge loss constraints • Instance label loss RI is the max over the loss at the category level and the loss at the instance level [Bengio et al. NIPS 2010] • Pose labelloss RPis the max over the loss at the category, instance, view, and pose levels
Object-Pose Tree Learning • Category and instance constraints (in RC ,RI) are the standard multi-class SVM hinge loss: • View and pose constraints (in RP): • Need to normalize view and pose labels • Δ(yi,y)maps the angle differences from [0,to [0,1] • Modified constraints: • Minimize the overall objective using Stochastic Gradient Descent (SGD):
Incremental Learning • Stochastic gradient descent (SGD) makes it possible efficiently update the tree when adding new objects. • Setup: • Train a tree with 290 objects offline • Add 10 objects and update tree using SGD from scratch (SGD), or initialized with previous weights (warm SGD) • Result: • 5-10x speedup while yielding the same accuracy
RGB-D Object Dataset • 250,000 RGB-Depth image pairs (640x480) • 300 objects in 51 categories annotated with ground truth poses A Large-Scale Hierarchical Multi-View RGB-D Object Dataset (Lai et al. ICRA 2011) Available at: http://www.cs.washington.edu/rgbd-dataset
Evaluation • Object-Pose Tree • Joint learning of tree parameters • Learn parameters of each node as independent binary SVMs • K-Nearest Neighbor classifier • Exact and approximate • One-versus-all Support Vector Machine • SVM for category and instance recognition • Infeasible to train a binary SVM for every pose for every object • K-NN within each instance for pose estimation
Results OPTree: Object-Pose Tree (our approach) NN: k-nearest neighbors FLANN: Approximate k-nearest neighbors [Muja and Lowe, VISSAPP 2009] 1vsA+NN: One-versus-all SVM for category and instance, k-nearest neighbor within instance for pose Indep Tree: Object-Pose Tree where each level of the tree is learned with a separate multi-class SVM optimization
Example Results Matches Query 1 2 3 4 5
Summary • Tree-based learning and classification framework • Jointly perform object category, instance, and pose recognition • Scales sub-linearly with the number of objects and poses • Online updating of parameters using stochastic gradient descent when adding new objects • Outperforms existing object recognition approaches in both accuracy and running time on the RGB-D Object Dataset, containing 300 everyday objects • Available at: http://www.cs.washington.edu/rgbd-dataset
Incremental Learning • Additional Experiment 1: • Train OPTree with 250 objects offline • Add 10 objects at a time for 5 rounds, updatingtree using 5000 iterations of SGD each round • Result: Warm SGD obtains within 1% test accuracy of SGD from scratch • Additional Experiment 2: • Train OPTree from scratch, adding 10 objects at a time for 30 rounds • Result: Warm SGD with 5000 iterations per round obtains within
Evaluation • Category and Instance Recognition • Proportion of correctly labeled test samples • Pose Recognition • Difference between predicted and true poses, [0, π] → [0%, 100%] • 1. For all test images (0% pose accuracy if category or instance incorrect) • 2. Only for test images that were assigned the correct instance label) • Running time per test sample • Feature extraction: Same for all approaches (1s) • Compare running time (in seconds) for each approach
Recognition Results • Data: 42000 cropped RGB+depth image pairs of 300 objects • Feature extraction: 1 second per test image NN: Nearest Neighbor FLANN: Approximate nearest neighbor [Muja and Lowe, VISSAPP 2009]1vsA: One-versus-all SVM [Shalev-Schwartz, ICML 2007] +NN: Nearest neighbor within instance for pose estimation +RR: Ridge regression within instance for pose estimation OPTree: Our approach
Object-Pose Tree Learning • As in standard SVM optimization, we replace the dirac delta function with the hinge loss and introduce slack variables:
Object Recognition Pipeline Bag of Words Visual and depth features Object Distance Classifier Recognition image patch features image feature
RGB-D Object Dataset • 300 objects in 51 categories • 250,000 640x480 RGB-Depth frames total • 8 video sequences containing these objects (home and office environments) A Large-Scale Hierarchical Multi-View RGB-D Object Dataset (Lai et al. ICRA 2011) http://www.cs.washington.edu/rgbd-dataset
RGB-D Scenes A Large-Scale Hierarchical Multi-View RGB-D Object Dataset (Lai et al. ICRA 2011) http://www.cs.washington.edu/rgbd-dataset
Recognition Results • Data: 42000 cropped RGB+depth image pairs of 300 objects • Feature extraction: 1 second per test image OPTree: Object-Pose Tree (our approach) NN: k-nearest neighbor FLANN: Approximate k-nearest neighbor [Muja and Lowe, VISSAPP 2009]
Recognition Results OPTree: Object-Pose Tree (our approach) 1vsA+NN: One-versus-all SVM for category and instance, nearest neighbor within instance for pose
Recognition Results OPTree: Object-Pose Tree (our approach) OPTree+NN: OPTree for category and instance, nearest neighbor within instance for pose IndepTree: Object-Pose Tree where each level of the tree is learned with a separate multi-class svmoptimization
Generalization to Novel Objects • Leave-sequence-out evaluation • 94.3% - Category recognition accuracy • 53.5% - penalize incorrect category and instance • 56.8% - category correct, penalize incorrect instance • 67.1% - category correct, estimate pose even if instance incorrect • 68.3% - instance correct • Leave-object-out evaluation (train on 249 objects, test on 51) • 84.4% - Category recognition accuracy • 52.8% - penalize incorrect category and instance • 62.5% - category correct, estimate pose even if instance incorrect
Motivation • A unified framework for robust object recognition and pose estimation in real-time • Existing approaches have considered these tasks in isolation • Scale efficiently with the number of objects and poses • Incremental learning of new objects