170 likes | 301 Views
Capturing Human Insight for Visual Learning. Kristen Grauman Department of Computer Science University of Texas at Austin. Frontiers in Computer Vision Workshop, MIT August 22, 2011.
E N D
Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Frontiers in Computer Vision Workshop, MIT August 22, 2011 Work with SudheendraVijayanarasimhan, Adriana Kovashka, Devi Parikh, Prateek Jain, Sung JuHwang, and Jeff Donahue
Problem: how to capture human insight about the visual world? • Point+label “mold” restrictive • Human effort expensive Annotator [tiny image montage by Torralba et al.] The complex space of visual objects, activities, and scenes.
Problem: how to capture human insight about the visual world? • Our approach: Ask: Actively learn Annotator Listen: Explanations, Comparisons, Implied cues,… [tiny image montage by Torralba et al.] The complex space of visual objects, activities, and scenes.
Deepening human communication to the system ? ? What is this? Is it ‘furry’? What’s worth mentioning? How do you know? < ? What property is changing here? Do you find him attractive? Why? Which is more ‘open’? [Donahue & Grauman ICCV 2011; Hwang & Grauman BMVC 2010; Parikh & Grauman ICCV 2011, CVPR 2011; Kovashka et al. ICCV 2011]
Soliciting rationales • We propose to ask the annotator not just what, but also why. Is her form perfect? Is the team winning? Is it a safe route? How can you tell? How can you tell? How can you tell?
Soliciting rationales Spatial rationale Spatial rationale Annotation task: Is her form perfect? How can you tell? pointed toes balanced falling knee angled Attribute rationale Synthetic contrast example Influence on classifier Attribute rationale balanced Good form balanced falling pointed toes Bad form pointed toes knee angled knee angled Synthetic contrast example [Zaidan et al. HLT 2007] [Donahue & Grauman, ICCV 2011]
Rationale results • Scene Categories: How can you tell the scene category? • Hot or Not: What makes them hot (or not)? • Public Figures: What attributes make them (un)attractive? Collect rationales from hundreds of MTurk workers. [Donahue & Grauman, ICCV 2011]
Rationale results Mean AP [Donahue & Grauman, ICCV 2011]
Learning what to mention • Issue: presence of objects != significance • Our idea: Learn cross-modalrepresentation that accounts for “what to mention” • Visual: • Texture • Scene • Color… TAGS: Cow Birds Architecture Water Sky Birds Architecture Water Cow Sky Tiles • Textual: • Frequency • Relative order • Mutual proximity Training: human-given descriptions
Learning what to mention View y View x Importance-aware semantic space [Hwang & Grauman, BMVC 2010]
Learning what to mention: results Visual only Words+ Visual Our method Query Image [Hwang & Grauman, BMVC 2010]
Problem: how to capture human insight about the visual world? • Our approach: Ask: Actively learn Annotator Listen: Explanations, Comparisons, Implied cues [tiny image montage by Torralba et al.] The complex space of visual objects, activities, and scenes.
Traditional active learning • At each cycle, obtain label for the most informative or uncertain example. [Mackay 1992, Freund et al. 1997, Tong & Koller 2001, Lindenbaum et al. 2004, Kapoor et al. 2007,…] Current Model Annotator Labeled data Unlabeled data ? Active Selection
Challenges in active visual learning • Annotation tasks vary in cost and info • Multipleannotators working parallel • Massive unlabeled pools of data Annotator Current Model Labeled data $ $ $ $ $ Unlabeled data $ ? Active Selection [Vijayanarasimhan & Grauman NIPS 2008, CVPR 2009, Vijayanarasimhan et al. CVPR 2010, CVPR 2011, Kovashka et al. ICCV 2011]
Current classifier 110 101 111 Actively selected examples Hash table Unlabeled data Sub-linear time active selection We propose a novel hashing approach to identify the most uncertain examples in sub-linear time. For 4.5 million unlabeled instances, 10 minutes machine time per iter, vs. 60 hours for a naïve scan. [Jain, Vijayanarasimhan, Grauman, NIPS 2010]
Live active learning results on Flickr test set Outperforms status quo data collection approach [Vijayanarasimhan & Grauman, CVPR 2011]
Summary • Humans are not simply “label machines” • Widen access to visual knowledge • New forms of input, often requiring associated new learning algorithms • Manage large-scale annotation efficiently • Cost-sensitive active question asking • Live learning: moving beyond canned datasets