1 / 67

Efficient Visual Search for Objects in Videos

Efficient Visual Search for Objects in Videos. Josef Sivic and Andrew Zisserman Presenters: Ilge Akkaya & Jeannette Chang March 1, 2011. Introduction. Image Query. Text Query. Results: Frames. Results: Documents. Generalize text retrieval methods to non-textual information.

Download Presentation

Efficient Visual Search for Objects in Videos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Visual Search for Objects in Videos Josef Sivic and Andrew Zisserman Presenters: IlgeAkkaya & Jeannette Chang March 1, 2011

  2. Introduction Image Query Text Query Results: Frames Results: Documents Generalize text retrieval methods to non-textual information

  3. State-of-the-Art before this paper… • Text-based search for images (Google Images) • Object recognition • Barnard, et al. (2003): “Matching words and pictures” • Sivic, et al. (2005): “Discovering objects and their location in images” • Sudderth, et al. (2005): “Learning hierarchical models of scenes, objects, and parts” • Scene classification • Fei-Feiand Perona (2005): “A Bayesian hierarchical model for learning natural scene categories” • Quelhas, et al. (2005): “Modeling scenes with local descriptors and latent aspects” • Lazebnik, et al. (2006): “Beyond bag of features: Spatial pyramid matching for recognizing natural scene categories”

  4. Introduction (cont.) • Retrieve specific objects vs. categories of objects/scenes (“Camry” logo vs. cars) • Employ text retrieval techniques for visual search, with images as queries and results • Why Text Retrieval Approach? • Matches essentially precomputed so that no delay at run time • Any object in video can be retrieved without modification of descriptors originally built for video

  5. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  6. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  7. Pre-Processing (Offline) For each frame, detect affine covariant regions. Track the regions through video and reject unstable regions Build visual vocabulary Remove stop-listed visual words Compute tf-idf weighted document frequency vectors Built inverted file-indexing structure

  8. Detection of Affine Covariant Regions Typically ~1200 regions / frame (720x576) Elliptical regions Each region represented by 128-dimensional SIFT vector SIFT features provide invariance against affine transformations

  9. Two types of affine covariant regions: • Shape-Adapted(SA): • Mikolajczyk et al. • Elliptical Shape adaptation about a Harris interest point • Often centered on corner-like features • Maximally-Stable(MS): • Proposed by Matas et al. • Intensity watershed image segmentation • High-contrast blobs

  10. Pre-Processing (Offline) For each frame, detect affine covariant regions. Track the regions through video and reject unstable regions Build visual vocabulary Remove stop-listed visual words Compute tf-idf weighted document frequency vectors Built inverted file-indexing structure

  11. Tracking regions through video and rejecting unstable regions Any region that does not survive for 3+ frames is rejected These regions are not potentially interesting Reduces number of regions/frame to approx. 50% (~600/frame)

  12. Pre-Processing (Offline) For each frame, detect affine covariant regions. Track the regions through video and reject unstable regions Build visual vocabulary Remove stop-listed visual words Compute tf-idf weighted document frequency vectors Built inverted file-indexing structure

  13. Visual Indexing Using Text-Retrieval Methods

  14. Visual Vocabulary Purpose: Cluster regions from multiple frames into fewer groups called ‘visual words’ Each descriptor: 128-vector K-means clustering(explain more) ~300K descriptors mapped into 16K visual words (600 regions/frame x ~500 frames) (6000 SA, 10000 MS regions used)

  15. K-Means Clustering Purpose: Cluster N data points (SIFT descriptors) into K clusters (visual words) K = desired number of cluster centers (mean points) Step 1: Randomly guess K mean points

  16. Step 2: Calculate nearest mean point to assign each data point to a cluster center In this paper, Mahalanobis distance is used to determine ‘nearest cluster center’. where ∑ is the covariance matrix for all descriptors, x2 is the length 128 mean vector and x1’s are the descriptor vectors(i.e. data points)

  17. Step 3: Recalculate cluster centers and distances, repeat until stationarity

  18. Examples of Clusters of Regions Samples of normalized affine covariant regions

  19. Pre-Processing (Offline) For each frame, detect affine covariant regions. Track the regions through video and reject unstable regions Build visual vocabulary Remove stop-listed visual words Compute tf-idf weighted document frequency vectors Built inverted file-indexing structure

  20. Remove Stop-Listed Words Analogy to text-retrieval: ‘a’, ‘and’, ‘the’ … are not distinctive words Common words cause mismatches 5-10% of the most common visual words are stopped 800-1600 / 16000 words are stopped (Upper row) Matches before stop-listing (Lower row) Matches after stop-listing

  21. Pre-Processing (Offline) For each frame, detect affine covariant regions. Track the regions through video and reject unstable regions Build visual vocabulary Remove stop-listed visual words Compute tf-idf weighted document frequency vectors Built inverted file-indexing structure

  22. tf-idf Weighting(term frequency-inverse document frequency weighting) nid : #of occurrences of word(visual word) i in document(frame) d nd : total number of words in document d Ni : total number of documents containing term I N : number of documents in the database ti: weighted word frequency

  23. Each document(frame) represented by: where v = number of visual words in vocabulary And vd = the tf-idf vector of the particular frame d

  24. Inverted File Indexing

  25. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  26. Real-Time Query Determine the set of visual words found within the query region Retrieve keyframes based on visual word frequencies (Ns = 500) Re-rank retrieved keyframes using spatial consistency

  27. Retrieve keyframes based on visual word frequencies vq: vector containing visual word frequencies corresponding to query region is computed the normalized scalar product of vqwith vd’s are computed:

  28. Spatial Consistency Voting Analogy: Google text document retrieval Matched covariant regions in the retrieved frames should have a similar spatial arrangement Search area: 15 nearest spatial neighbors of each match Each neighboring region which also matches in the retrieved frame, votes for the frame

  29. Spatial Consistency Voting Matched pair of words (A,B) Each region in defined search area in both frames casts a vote For the match (A,B) (upper row)Matches after stop-listing (lower row) Remaining matches after spatial consistency voting

  30. Query Frame Sample Retrieved Frame 1: Query Region 2: Close-up version of 1 3-4: Initial matches 5-6: Matches after stop-listing 7-8: Matches after spatial consistency matching

  31. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  32. Implementation Details Offline Processing: 100-150K frames/typical feature length film, Refined to 4000-6000 keyframes Descriptors are computed for stable regions in each frame Each region is assigned to a visual word Visual words over all keyframes assembled into an inverted file-structure

  33. Algorithm Implementation Real-Time Process: User selects query region Visual words are identified within query region A short list of Ns = 500 keyframes retrieved based on tf-idf vector similarity Similarity is recomputed considering spatial consistency voting

  34. Example Visual Search

  35. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  36. Retrieval Examples Query Image A Few Retrieved Matches

  37. Retrieval Examples (cont.) Query Image A Few Retrieved Matches

  38. Performance of the Algorithm Tried 6 object queries (2) Black Clock (1) Red Clock (3) “Frame’s” Sign (4) Digital Clock (5) “Phil” Sign (6) Microphone

  39. Performance of the Algorithm (cont.) • Evaluated on the level of shots rather than keyframes • Measured using precision-recall plots • Precision like measure of fidelity or exactness • Recall like measure of completeness

  40. Performance of the Algorithm (cont.) Ideally,precision = 1 for all recall values Average Precision (AP) , ideally AP = 1

  41. Examples of Missed Shots Extreme viewing angles Original query object Low-ranked shot

  42. Examples of Missed Shots (cont.) Significant changes in scale and motion blurring Low-ranked shot Original query object

  43. Qualitative Assessment of Performance • General trends • Higher precision at low recall levels • Bias towards lightly textured regions detectable by SA/MS detectors • Could address these challenges by adding more covariant regions • Other Difficulties • Textureless regions (e.g., mug) • Thin or wiry objects (e.g., bike) • Highly-deformable (e.g., clothing)

  44. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  45. Quality of Individual Visual Words • Using single visual word as query • Tests the expressiveness of the visual vocabulary • Sample query • Given an object of interest, select one of the visual words from that object • Retrieve all frames that contain the visual word (no ranking) • Retrieval considered correct if contains object of interest

  46. Examples of Individual Visual Words Top row: Scale-normalized close-ups of elliptical regions overlaid on query image Bottom row: Corresponding normalized regions

  47. Results of Individual Word Searches Individual words are “noisy” Intuitively because words occur in multiple objects and do not cover all occurrences of the object

  48. Quality of Individual Visual Words Unrealistic Realistic Require each word to occur on only one object (high precision) Growing number of objects would result in growing number of words Visual words shared across objects, with objects represented by a combination of words

  49. Overview of the Talk • Visual Search Algorithm • Offline Pre-Processing • Real-Time Query • A Few Implementation Details • Performance • General Results • Testing Individual Words • Using External Images As Queries • A Few Challenges and Future Directions • Concluding Remarks • Demo of the Algorithm

  50. Searching for Objects From Outside of the Movie Used external query images from the internet Manually labeled all occurrences of external queries in movies Results

More Related