Ken Chatfield James Philbin Andrew Zisserman

Efficient Retrieval of Deformable Shape Classes using Local Self-Similarities Ken Chatfield James Philbin Andrew Zisserman NORDIA ‘09, 27 September 2009 University of Oxford TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Objective Goal: • Fast and accurate retrieval based on abstract shape • Example: extract shape from images below efficiently Use the descriptor of Shectman and Irani • Extend the descriptor to provide fast shape matching • Incorporate into scalable shape-based retrieval framework • Theme: efficiency Shectman and Irani [CVPR ’07] Abstract Shapes

The Self-Similarity Descriptor – Review (a) (b) (c) (d) 2. Bin Maximum Similarities High Self- Similarity 1. Generate Correlation Surface (SSD) Low Self- Similarity Shectman and Irani [CVPR ’07] Abstract Shapes

Implicit Shape Model – Review Objective: • Use descriptor data to find location of query shape in target image • Account for non-rigid deformation query target Approach: Incorporate our set of descriptors into ISM (descriptors manually selected for now) Apply the Generalised Hough Transform – Store offsets to an arbitrary object centre for descriptors in the query Find putative matches in target Apply same offsets: (x, y) →(xc, yc) Identify modes in Hough voting space Apply Parzen window method with Gaussian basis h a j a i h i j b g g b f c f c d d e e Query Target Leibe, Leonardis and Schiele [ECCV ’04]

Datasets • ETHZ Deformable Shape Classes used as our primary test dataset • Four main deformable object classes used • 254 images in total Illustrates some of the variation we want to account for: • Abstract shape representation Accounted for by descriptor • Changes in scale Multiple Hough voting passes • Non-rigid deformation Account for in ISM not explicitly accounted for in S&I

Searching over Scale 1. Select Query Points 4. Establish Point Correspondences Query Target support radius 2. Accumulate Centre Votes How does this all fit together? 0.6 0.8 1.3 3. Select Match Scale

Non-rigid Deformation • ‘Support Radius’ can be used to implicitly account for non-rigid deformation • Example: larger radius used Query Target

Improving Efficiency OUTLINE: 1) Basic Shape Matching Review 2)Improving Efficiency 3) Scalable and Efficient Retrieval 4)Evaluation of Results

Improving Efficiency Improve efficiency in two main ways: • Cut down the number of descriptors used for matching • Incorporate into efficient retrieval system using visual words and ‘Video Google’ ideas (will return to this in a second) 1. Shape Matching – Objective: • Instead of manually defining points, user selects ROI in query • System should then return regions containing the same shape in the target • Naïve approach → dense sampling • ISM well suited to this if the descriptors are all sufficiently informative BUT computationally expensive ? Query Target

Efficiency-oriented Descriptor Selection Instead, cut down the number of descriptors by: • Eliminating Homogeneous descriptors as in S&I • Applying 2NN thresholding: • 85% reduction in descriptor count • 97.75% reduction in runtime Without negatively impacting matching performance Query Target Result

What about the descriptor? • Technically descriptors must be recomputed for each match scale • However they exhibit a degree of inbuilt scale invariance due to: (i) use of correlation patches as basic unit (ii) log-polar binning • Therefore, same descriptors used for all scales→further efficiency gain • Log-polar binning also helps tolerance to non-rigid deformation

Shape-based Retrieval OUTLINE: 1) Basic Shape Matching Review • 2) Improving Efficiency 3)Scalable and Efficient Retrieval 4)Evaluation of Results Sivic and Zisserman [ICCV ’03] Nister and Stewenius [CVPR ’06] Chum et al. [ICCV ’07] Philbin et al. [CVPR ’07] Jegouet al. [ECCV ’08]

Text Retrieval Approach – Review 1. Train Vocabulary –develop a visual analogue to the textual word 1. descriptor quantization 2. cluster centres K-means Clustering 3. visual word assignment 2. Use Vocabulary 4. bag-of-words vector (generated offline) Bag-of-words formulation allows application of standard IR techniques based on Sivic and Zisserman [ICCV ’03]

Visual Words – Examples • Vocabulary size: 1,000-10,000 words • Training set: 254 images of ETHZ shape classes dataset * points highlighted in green in each image indicate occurrences of each given visual word

Using Visual Words Ranking: • Use standard tf-idf architecture to rank • Given weighted vectors, only need perform single scalar product for each of our N images to rank • If images contain an average of W unique visual word ‘term frequencies’: → O(NW) • Compare to complexity of matching stage: O(ND2L) Matching: • Retain spatial location of visual word occurrences • Descriptors then effectively pre-matched offline (by descriptors assigned to the same visual word in both images) • Complexity now reduces to O(W2) instead of O(D2L) per image N – number of images W – average number of visual words per image D – average number of descriptors per image (D>W) L – number of dimensions of descriptor itself

Evaluation OUTLINE: 1) Basic Shape Matching Review • 2) Improving Efficiency • 3) Scalable and Efficient Retrieval 4)Evaluation of Results

Matching Results Objective: • Establish localisation performance within images where query exists Experiment: • Four queries within each of the four shape classes of the ETHZ dataset Tested for proportion of images where: (of actual and estimated bounding box in target) intersection > 0.8 union actual estimated these results despite…

Matching Results

Retrieval Results • Retrieval algorithm tested over whole of ETHZ dataset • Vocabulary of 10,000 words trained • Low recall → good performance • Compared to Philbin et al.’s ‘Visual Google’ with Hessian-Affine detections and SIFT descriptors (a) swans (b) bottles (c) apple-logos (d) mugs

Video Retrieval • Episode of Lost used to test scalability • Chosen due to lack of large-scale shape based datasets • Retrain a new vocabulary of 10,000 words • Particularly challenging – query present in only 84/2,721 frames, subject to variety of affine deformations • Again, at low recall → algorithm performs well • Completes in < 3 seconds (unoptimised MATLAB) Query

Conclusion • Presented self-similarity based approach to matching deformable shape classes • Demonstrated fast and efficient visual word-based retrieval scheme →outperforms SIFT across shape-based datasets Future work • Further develop retrieval stage • More advanced text retrieval techniques, spatial verification • Incorporate into other representations of deformation • Deformable object recognition scheme of Ferrari et al. • Registration schemes of Gay-Bellile et al. or Pilet et al.

Ken Chatfield James Philbin Andrew Zisserman