Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Sung Ju Hwang and Kristen Grauman University of Texas at Austin Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Detecting tagged objects Images tagged with keywords clearly tell us which objects to search for Dog Black lab Jasper Sofa Self Living room Fedora Explore #24

Detecting tagged objects Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 2002 Berg et al. 2004 Li et al., 2009 Fergus et al. 2005 [Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …]

Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster ? Mug Key Keyboard Toothbrush Pen Photo Post-it ? Based on tags alone, can you guess where and what size the mug will be in each image?

Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Mug Key Keyboard Toothbrush Pen Photo Post-it Mug is named first Mug is named later Presence of larger objects Absence of larger objects

Our Idea We propose to learn the implicit localization cues provided by tag lists to improve object detection.

Approach overview Training: Learn object-specific connection between localization parameters and implicit tag features. P (location, scale | tags) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Given novel image, localize objects based on both tags and appearance. Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features

Feature: Word presence/absence Presence or absence of other objects affects the scene layout Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster , where = count of i-th word.

Feature: Word presence/absence Presence or absence of other objects affects the scene layout Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster , where = count of i-th word. Small objects mentioned Large objects mentioned

Feature: Rank of tags People tag the “important” objects earlier People tag the “important” objects earlier  record rank of each tag compared to its typical rank. , where = percentile rank of i-th word. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster

Feature: Rank of tags People tag the “important” objects earlier People tag the “important” objects earlier  record rank of each tag compared to its typical rank. , where = percentile rank of i-th word. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Relatively high rank

Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation , where = rank difference. 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 6 5 4 7 2 1 9 7 5 8 4 2 6 3 3 1

Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation , where = rank difference. 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 6 5 4 7 2 1 9 7 5 8 4 2 6 3 3 1 May be close to each other

Approach overview Training: P (location, scale | W,R,P) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features

Modeling P(X|T) We need PDF for location and scale of the target object, given the tag feature: P(X = scale, x, y | T = tag feature) Mixture model α µ α µ α µ Σ Σ Σ Neural network Input tag feature (Words, Rank, or Proximity) We model it directly using a mixture density network (MDN) [Bishop, 1994].

Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Lamp Car Wheel Wheel Light Window House House Car Car Road House Lightpole Boulder Car Car Windows Building Man Barrel Car Truck Car

Approach overview Training: P (location, scale | W,R,P) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming)

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) (a) Sort all candidate windows according to P(X|T). Most likely Less likely Least likely (b) Run detector only at the most probable locations and scales.

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) • Use it to increase detection accuracy (modulate the detector output scores) Predictions based on tag features Predictions from object detector 0.9 0.3 0.7 0.8 0.2 0.9

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) • Use it to increase detection accuracy (modulate the detector output scores) 0.63 0.24 0.18

Experiments: Datasets LabelMe PASCAL VOC 2007 • Street and office scenes • Contains ordered tag lists via labels added • 5 classes • 56 unique taggers • 23 tags / image • Dalal & Trigg’s HOG detector • Flickr images • Tag lists obtained on Mechanical Turk • 20 classes • 758 unique taggers • 5.5 tags / image • Felzenszwalb et al.’s LSVM detector

Experiments We evaluate • Detection Speed • Detection Accuracy We compare • Raw detector (HOG, LSVM) • Raw detector + Our tag features We also show the results when using Gist [Torralba 2003] as context, for reference.

PASCAL: Performance evaluation Naïve sliding window searches 70%. We search only 30%. We search fewer windows to achieve same detection rate. We know which detection hypotheses to trust most.

PASCAL: Accuracy vs Gist per class

PASCAL: Example detections Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Lamp Person Bottle Dog Sofa Painting Table LSVM alone LSVM+Tags (Ours) Car Car Door Door Gear Steering Wheel Seat Seat Person Person Camera Car License Plate Building

PASCAL: Example detections LSVM+Tags (Ours) LSVM alone Dog Dog Floor Hairclip Dog Dog Dog Person Person Ground Bench Scarf Person Horse Person Tree House Building Ground Hurdle Fence Person Microphone Light

PASCAL: Example failure cases LSVM+Tags (Ours) LSVM alone Bottle Glass Wine Table Aeroplane Sky Building Shadow Person Person Pole Building Sidewalk Grass Road Dog Clothes Rope Rope Plant Ground Shadow String Wall

Results: Observations • - scale well for indoor objects • Often our implicit features predict: • - position well for outdoor objects • Gist usually better for y position, while our tags are generally stronger for scale • - visual and tag context are complementary • Need to have learned about target objects in variety of examples with different contexts

Summary • We want to learn what is implied (beyond objects present) by how a human provides tags for an image. • Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection. • Novel tag cues enable effective localization prior. • Significant gains with state-of-the-art detectors and two datasets.

Future work • Joint multi-object detection • From tags to natural language sentences • Image retrieval applications

Summary • We want to learn what is implied (beyond objects present) by how a human provides tags for an image. • Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection. • Novel tag cues enable effective localization prior. • Significant gains with state-of-the-art detectors and two datasets.

Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags