1 / 34

Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010. Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags. Detecting tagged objects. Images tagged with keywords clearly tell us which objects to search for. Dog Black lab Jasper

signa
Download Presentation

Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010 Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

  2. Detecting tagged objects Images tagged with keywords clearly tell us which objects to search for Dog Black lab Jasper Sofa Self Living room Fedora Explore #24 Hwang & Grauman, CVPR 2010

  3. Detecting tagged objects Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 2002 Berg et al. 2004 Li et al., 2009 Fergus et al. 2005 [Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …] Hwang & Grauman, CVPR 2010

  4. Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster ? Mug Key Keyboard Toothbrush Pen Photo Post-it ? Based on tags alone, can you guess where and what size the mug will be in each image? Hwang & Grauman, CVPR 2010

  5. Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Mug Key Keyboard Toothbrush Pen Photo Post-it Mug is named first Mug is named later Presence of larger objects Absence of larger objects Hwang & Grauman, CVPR 2010

  6. Our Idea We propose to learn the implicit localization cues provided by tag lists to improve object detection. Hwang & Grauman, CVPR 2010

  7. Approach overview Training: Learn object-specific connection between localization parameters and implicit tag features. P (location, scale | tags) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Given novel image, localize objects based on both tags and appearance. Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

  8. Approach overview Training: Learn object-specific connection between localization parameters and implicit tag features. P (location, scale | tags) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Given novel image, localize objects based on both tags and appearance. Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

  9. Feature: Word presence/absence Presence or absence of other objects affects the scene layout Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster , where = count of i-th word. Hwang & Grauman, CVPR 2010

  10. Feature: Word presence/absence Presence or absence of other objects affects the scene layout Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster , where = count of i-th word. Small objects mentioned Large objects mentioned Hwang & Grauman, CVPR 2010

  11. Feature: Rank of tags People tag the “important” objects earlier People tag the “important” objects earlier  record rank of each tag compared to its typical rank. , where = percentile rank of i-th word. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Hwang & Grauman, CVPR 2010

  12. Feature: Rank of tags People tag the “important” objects earlier People tag the “important” objects earlier  record rank of each tag compared to its typical rank. , where = percentile rank of i-th word. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Relatively high rank Hwang & Grauman, CVPR 2010

  13. Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation , where = rank difference. 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 6 5 4 7 2 1 9 7 5 8 4 2 6 3 3 1 Hwang & Grauman, CVPR 2010

  14. Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation , where = rank difference. 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 6 5 4 7 2 1 9 7 5 8 4 2 6 3 3 1 May be close to each other Hwang & Grauman, CVPR 2010

  15. Approach overview Training: P (location, scale | W,R,P) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

  16. Modeling P(X|T) We need PDF for location and scale of the target object, given the tag feature: P(X = scale, x, y | T = tag feature) Mixture model α µ α µ α µ Σ Σ Σ Neural network Input tag feature (Words, Rank, or Proximity) We model it directly using a mixture density network (MDN) [Bishop, 1994]. Hwang & Grauman, CVPR 2010

  17. Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Lamp Car Wheel Wheel Light Window House House Car Car Road House Lightpole Boulder Car Car Windows Building Man Barrel Car Truck Car Hwang & Grauman, CVPR 2010

  18. Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Lamp Car Wheel Wheel Light Window House House Car Car Road House Lightpole Boulder Car Car Windows Building Man Barrel Car Truck Car Hwang & Grauman, CVPR 2010

  19. Approach overview Training: P (location, scale | W,R,P) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

  20. Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) Hwang & Grauman, CVPR 2010

  21. Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) (a) Sort all candidate windows according to P(X|T). Most likely Less likely Least likely (b) Run detector only at the most probable locations and scales. Hwang & Grauman, CVPR 2010

  22. Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) • Use it to increase detection accuracy (modulate the detector output scores) Predictions based on tag features Predictions from object detector 0.9 0.3 0.7 0.8 0.2 0.9

  23. Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) • Use it to increase detection accuracy (modulate the detector output scores) 0.63 0.24 0.18

  24. Experiments: Datasets LabelMe PASCAL VOC 2007 • Street and office scenes • Contains ordered tag lists via labels added • 5 classes • 56 unique taggers • 23 tags / image • Dalal & Trigg’s HOG detector • Flickr images • Tag lists obtained on Mechanical Turk • 20 classes • 758 unique taggers • 5.5 tags / image • Felzenszwalb et al.’s LSVM detector Hwang & Grauman, CVPR 2010

  25. Experiments We evaluate • Detection Speed • Detection Accuracy We compare • Raw detector (HOG, LSVM) • Raw detector + Our tag features We also show the results when using Gist [Torralba 2003] as context, for reference. Hwang & Grauman, CVPR 2010

  26. PASCAL: Performance evaluation Naïve sliding window searches 70%. We search only 30%. We search fewer windows to achieve same detection rate. We know which detection hypotheses to trust most. Hwang & Grauman, CVPR 2010

  27. PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

  28. PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

  29. PASCAL: Example detections Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Lamp Person Bottle Dog Sofa Painting Table LSVM alone LSVM+Tags (Ours) Car Car Door Door Gear Steering Wheel Seat Seat Person Person Camera Car License Plate Building Hwang & Grauman, CVPR 2010

  30. PASCAL: Example detections LSVM+Tags (Ours) LSVM alone Dog Dog Floor Hairclip Dog Dog Dog Person Person Ground Bench Scarf Person Horse Person Tree House Building Ground Hurdle Fence Person Microphone Light Hwang & Grauman, CVPR 2010

  31. PASCAL: Example failure cases LSVM+Tags (Ours) LSVM alone Bottle Glass Wine Table Aeroplane Sky Building Shadow Person Person Pole Building Sidewalk Grass Road Dog Clothes Rope Rope Plant Ground Shadow String Wall Hwang & Grauman, CVPR 2010

  32. Results: Observations • - scale well for indoor objects • Often our implicit features predict: • - position well for outdoor objects • Gist usually better for y position, while our tags are generally stronger for scale • - visual and tag context are complementary • Need to have learned about target objects in variety of examples with different contexts Hwang & Grauman, CVPR 2010

  33. Summary • We want to learn what is implied (beyond objects present) by how a human provides tags for an image. • Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection. • Novel tag cues enable effective localization prior. • Significant gains with state-of-the-art detectors and two datasets. Hwang & Grauman, CVPR 2010

  34. Future work • Joint multi-object detection • From tags to natural language sentences • Image retrieval applications Hwang & Grauman, CVPR 2010

More Related