Bilinear Classifiers for Visual Recognition: A Comprehensive Overview

Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva Ramanan Charless Fowlkes

Reshape (:) n f 2 3 n n y x 6 7 6 7 . 6 7 x = . 6 7 . 6 7 4 5 1 £ n n n f y x Introduction • Linear model for visual recognition A window on input image Features Feature vector 1

T 2 3 2 3 T 6 7 6 7 0 > w x 6 7 6 7 0 . . 6 7 6 7 > . . 6 7 6 7 . . 6 7 6 7 4 5 4 5 Introduction • Linear classifier • Learn a template • Apply it to all possible windows Template(Model) Feature vector

T T W W Features W W W W = f f s s ( ) d d 2 3 i · n n n n : m n n n n n = f f f f s x y s n ; : : : f n . 6 7 n y . x X 6 7 . = 6 7 4 5 Reshape £ n n n f y x Introduction 2 3 2 3 : : : . 6 7 6 7 h i : : : . £ . 6 7 6 7 . = . . 6 7 6 7 . : : : d £ . n f 4 5 4 5 . d £ £ n n n f s s

Introduction • Motivation for bilinear models • Reduced rank: less number of parameters • Better generalization: reduced over-fitting • Run-time efficiency • Transfer learning • Share a subset of parameters between different but related tasks

Outline • Introduction • Sliding window classifiers • Bilinear model and its motivation • Extension • Related work • Experiments • Pedestrian detection • Human action classification • Conclusion

1 T T ( ) P ( ) L C i 0 1 + ¡ m n w w w m a x y w x = w n n 2 ; n T 0 > w x Sliding window classifiers • Extract some visual features from a spatio-temporal window • e.g., histogram of gradients (HOG) in Dalal and Triggs’ method • Train a linear SVM using annotated positive and negative instances • Detection: evaluate the model on all possible windows in space-scale domain • Use convolution since the model is linear

W Sliding window classifiers Sample template Sample image Features

Bilinear model (Definition) • Visual data are better modeled as matrices/tensors rather than vectors • Why not use the matrix structure

F X t d l M W e a u r e o e T T T T ( ( ) ) T T T W W X X 2 3 0 0 > > 2 3 2 3 n n n w n r r : x n n w x = = f f f T s x y n : : : f 2 3 2 3 : : : : : : n . 6 7 n 6 7 6 7 w y x . x X 6 7 . = . . 6 7 6 7 6 7 6 7 . 6 . 7 ( ) T . . 6 7 6 7 6 7 6 7 r . . = 4 5 . . 6 7 6 7 6 7 6 7 . . 4 5 4 5 6 7 6 7 4 5 4 5 £ n n f s Bilinear model (Definition) Features Reshape

d T T T T ( ( ) ) ( ( ) ) f d f W W W W X W T W W W X i 0 0 · > > w n n x x m n n r n = = f f f f f f s s s ; s Bilinear model (Definition) • Bilinear model 2 3 2 3 : : : . 6 7 6 7 h i : : : . £ . 6 7 6 7 . = . . 6 7 6 7 . : : : d £ . n f 4 5 4 5 . d £ £ n n n f s s

d T T T T T ( ( ( ) ) ) ( ( ( ) ) ) d f f f l d W B W W W X X W T T W W W W W X W X W i i i i 0 0 · > > n n w x x n m e a n r n n r r n a n = = = f f f f f f f f s s s s ; s s Bilinear model (Definition) • Bilinear model 2 3 2 3 : : : . 6 7 6 7 h i : : : . £ . 6 7 6 7 . = . . 6 7 6 7 . : : : d £ . n f 4 5 4 5 . d £ £ n n n f s s

d T T T T T T ( ( ( ( ) ) ) ) ( ( ( ( ) ) ) ) d f f f f l d W B W W W W X X X L W T T T W W W W W W X X W X W i i i i i 0 0 · > > n w n a x x s n m e a n r n e n n a r r r r : n a n = = = = f f f f f f f f s s s s ; s s Bilinear model (Definition) • Bilinear model 2 3 2 3 : : : . 6 7 6 7 h i : : : . £ . 6 7 6 7 . = . . 6 7 6 7 . : : : d £ . n f 4 5 4 5 . d £ £ n n n f s s

1 T T ( ) ( ) P ( ( ) ) L W T W W C T W X i 0 1 + ¡ m n r m a x y r = W n n 2 ; n f g X y n n ; Bilinear model (Learning) • Linear SVM for a given set of training pairs Objective function Regularizer Constraints Parameter

T 1 T T ( ) ( ) P ( ( ) ) W W L W W T W W C T W X i 0 1 + ¡ m n : r m a x y r = = W f s n n 2 ; n f g X y n n ; Bilinear model (Learning) • Linear SVM for a given set of training pairs Objective function Regularizer Constraints Parameter

1 T T T ( ) ( ) P ( ( ) ) L W W T W W W W C T W W X i 0 1 + ¡ m n r m a x y r = f f f f s s n n ; 2 ; s s n Bilinear model (Learning) • For Bilinear formulation • Biconvex so solve by coordinate decent • By fixing one set of parameters, it’s a typical SVM problem (with a change of basis) • Use off-the-shelf SVM solver in the loop

( ) d i · T m n n n W W W f s ; = f s n d f Motivation • Regularization • Similar to PCA, but not orthogonal and learned discriminatively and jointly with the template • Run-time efficiency • convolutions instead of Reduced dimensional Template Subspace

T W W W = f s W f Motivation • Transfer learning • Share the subspace between different problems • e.g human detector and cat detector • Optimize the summation of all objective functions • Learn a good subspace using all data

( ) L W W W ( ( ( ) ) ) L L L W W W W W W W W f x y f f ; ; t x x s y y ; ; ; ; ; Extension • Multi-linear • High-order tensors • instead of just • For 1D feature • Separable filter for (Rank=1) • Spatio-temporal templates

T ( ( ) ) T T W W W r r Related work (Rank restriction) • Bilinear models • Often used in increasing the flexibility; however, we use them to reduce the parameters. • Mostly used in generative models like density estimation and we use in classification • Soft Rank restriction • They used rather than in SVM to regularize on rank • Convex, but not easy to solve • Decrease summation of eigen values instead of the number of non-zero eigen values (rank) • Wolf et al (CVPR’07) • Used a formulation similar to ours with hard rank restriction • Showed results only for soft rank restriction • Used it only for one task (Didn’t consider multi-task learning)

Related work(Transfer learning) • Dates back to at least Caruana’s work (1997) • We got inspired by their work on multi-task learning • Worked on: Back-propagation nets and k-nearest neighbor • Ando and Zhang’s work (2005) • Linear model • All models share a component in low-dimensional subspace (transfer) • Use the same number of parameters

8 8 £ Experiments: Pedestrian detection • Baseline: Dalal and Triggs’ spatio-temporal classifier (ECCV’06) • Linear SVM on features: (84 for each cell) • Histogram of gradient (HOG) • Histogram of optical flow • Made sure that the spatiotemporal is better than the static one by modifying the learning method

d 1 4 6 8 4 5 1 0 £ n n o r = = = f s , , Experiments: Pedestrian detection • Dataset: INRIA motion and INRIA static • 3400 video frame pairs • 3500 static images • Typical values: • Evaluation • Average precision • Initialize with PCA in feature space • Ours is 10 times faster

Experiments: Pedestrian detection

Experiments: Pedestrian detection Link

Experiments: Human action classification 1 • 1 vs all action templates • Voting: • A second SVM on confidence values • Dataset: • UCF Sports Action (CVPR 2008) • They obtained 69.2% • We got 64.8% but • More classes: 12 classes rather than 9 • Smaller dataset: 150 videos rather than 200 • Harder evaluation protocol: 2-fold vs. LOOCV • 87 training examples rather than 199 in their case

UCF action Results: PCA (0.444)

UCF action Results: Linear (0.518)

UCF action Results: Bilinear (0.648)

UCF action Results Linear (0.518) (Not always feasible) PCA on features (0.444) Bilinear (0.648)

UCF action Results

Experiments: Human action classification 2 • Transfer • We used only two examples for each of 12 action classes • Once trained independently • Then trained jointly • Shared the subspace • Adjusted the C parameter for best result

1 2 1 2 W W W W W W m m T W W W f f f s s s = f s Transfer PCA to initialize Train Train Train Train Train Train

2 1 1 2 W W W W W W W m m T f W W W f f f s s s = f s i m P ( ) L W W i m n f W i 1 ; f s = Transfer PCA to initialize Train Train Train Train Train Train Train

Results: Transfer Average classification rate

Results: Transfer (for “walking”) Iteration 1 Refined at Iteration 2

Conclusion • Introduced multi-linear classifiers • Exploit natural matrix/tensor representation of spatio-temporal data • Trained with existing efficient linear solvers • Shared subspace for different problems • A novel form of transfer learning • Got better performance and about 10X speed up in run-time compared to the linear classifier. • Easy to apply to most high dimensional features (instead of dimensionality reduction methods like PCA) • Simple: ~ 20 lines of Matlab code

Thanks!

1 1 T T T T T ( ( ) ) ( ( ) P ) ( P ( ( ) ) ( ) ) L L W W W T T W W W W C W W C T W X T W W X i i 0 1 0 1 + + ¡ ¡ m m n n r r m a x m y a x r y r = = f f f W f s s n n n n ; 2 2 ; ; s s n n f g x y n n ; Bilinear model (Learning details) • Linear SVM for a given set of training pairs • For Bilinear formulation • It is biconvex so solve by coordinate decent

1 1 1 1 ¡ ¡ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ T T T T 1 T T 1 T T W W ( ) ( ) P ( ( ) ) W W W A A W ( X X ) X A W ( W A X ) A A P W W W W ( ( ) ) 2 2 2 2 L W W W W C T W X i L W W W W C T W X 0 1 i 0 1 + ¡ + ¡ f m n m a x y r = = = = = = m n m a x y r f f f f = ~ = f f ~ f s f f s f s f s s n n n n s s f f W , , , , s n n W s s s s n n ; 2 ; ; 2 ; s s n n f s Bilinear model (Learning details) • Each coordinate descent iteration: freeze where then freeze where

Bilinear Classifiers for Visual Recognition: A Comprehensive Overview

Bilinear Classifiers for Visual Recognition: A Comprehensive Overview

Presentation Transcript

Bilinear models for action and identity recognition

Visual Grouping and Recognition

Visual Word Recognition

Visual Word Recognition

Visual Word Recognition II

Visual Pattern Recognition

Recognition Using Visual Phrases

3D Visual Phrases for Landmark Recognition

Introduction: Convolutional Neural Networks for Visual Recognition

Visual Object Recognition

Visual Object Recognition

Visual Grouping and Recognition

Visual Object Recognition

Audio-Visual Speech Recognition

Audio Visual Speech Recognition

On Visual Recognition

Heterogeneous convolutional neural networks for visual recognition

Visual Object Recognition

Bilinear Transformation

Visual Recognition

The Visual Recognition Machine

Constructing Category Hierarchies for Visual Recognition