1 / 24

Learning from Descriptive Text Tamara L Berg Stony Brook University

Learning from Descriptive Text Tamara L Berg Stony Brook University. Tags: canon, eos , macro, japan, vacation, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo, art, light, photo, flickr , blurry, favorite, nice. Vision. Language. Humans.

gitano
Download Presentation

Learning from Descriptive Text Tamara L Berg Stony Brook University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning from Descriptive TextTamara L BergStony Brook University

  2. Tags: canon, eos, macro, japan, vacation, frog, animal, toad, amphibian, pet, eye, feet, mouth, finger, hand, prince, photo, art, light, photo, flickr, blurry, favorite, nice. Vision Language Humans It's the perfect party dress. With distinctly feminine details such as a wide sash bow around an empire waist and a deep scoopneck, this linen dress will keep you comfortable and feeling elegant all evening long.

  3. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – Gone with the Wind Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How do people describe the world? How does the world work? What should we recognize?

  4. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How do people describe the world? How does the world work? What should we recognize?

  5. What’s in a description? What’s in this image? man baby sling shirt glasses ladder fridge table watermelon chair boxes cups water bottle wall pacifier beard … What do people describe? “A bearded man is holding a child in a sling.” “A bearded man stands while holding a small child in a green sheet.” “A bearded man with a baby in a sling poses.” “Man standing in kitchen with little girl in green sack.” “Man with beard and baby”

  6. What’s in a description? • ✔ women • ✔ bench 1) “two women sitting brunette blonde on bench reading magazine” • ✔ magazine ✖ grass ✖ Predict what people will describe skirt … Given an image e.g. Spain & Perona, 2010 • ✔ clouds ✖ “looking for castles in the clouds out my car window” car 2) ✖ window ? castle Given a caption Predict what’s in the image

  7. T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth Who’s in the picture? President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

  8. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How do people describe the world? How does the world work? What should we recognize?

  9. Vision is hard World knowledge (from descriptive text) can be used to smooth noisy vision predictions! Green sheep

  10. Learning World Knowledge BabyTalk: Understanding and Generating Simple Image Descriptions Kulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011 Attributes a very shiny car in the car museum in my hometown of upstate NY. green green grass by the lake Relationships very little person in a big rocking chair Our cat Tusik sleeping on the sofa near a hot radiator.

  11. System Flow brown 0.01 striped 0.16 furry .26 wooden .2 feathered .06 ... near(a,b) 1 near(b,a) 1 against(a,b) .11 against(b,a) .04 beside(a,b) .24 beside(b,a) .17 ... a) dog a) dog a) dog This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa. brown 0.32 striped 0.09 furry .04 wooden .2 Feathered .04 ... near(a,c) 1 near(c,a) 1 against(a,c) .3 against(c,a) .05 beside(a,c) .5 beside(c,a) .45 ... b) person b) person b) person near(b,c) 1 near(c,b) 1 against(b,c) .67 against(c,b) .33 beside(b,c) .0 beside(c,b) .19 ... brown 0.94 striped 0.10 furry .06 wooden .8 Feathered .08 ... <<null,person_b>,against,<brown,sofa_c>> <<null,dog_a>,near,<null,person_b>> <<null,dog_a>,beside,<brown,sofa_c>> Input Image Generate natural language description Predict labeling – vision potentials smoothed with text potentials c) sofa c) sofa c) sofa Extract Objects/stuff Predict prepositions Predict attributes

  12. BabyTalk results Objects, Attributes, Prepositions This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. This is a picture of two dogs. The first dog is near the second furry dog. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky.

  13. Visually Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” – from Gone with the Wind by Margaret Mitchell Visually descriptive language provides: • information about how people construct natural language for imagery. • information about the world, especially the visual world. • guidance for computational visual recognition. How do people describe the world? How does the world work? What should we recognize?

  14. What should we recognize? • Recognition is beginning to work • Open question – what should we recognize? • Maybe objects aren’t (always) the right base level entities

  15. Object Recognition Parts, Poselets, Attributes For example: [Fergus, Perona, Zisserman2003], [Bourdev, Malik2009], … Slide Credit: Ali Farhadi

  16. Automatically Discovering Attributes from Noisy Web Data T.L. Berg, A.C. Berg, J. Shih ECCV 2010 Fully beaded with megawatt crystals, this Christian Louboutin suede pump matches the gleam in your eye. Pump's linear heel plays up the alluring curves of its dipped sides. Round toe frames low-cut vamp. Tonally topstitched collar. 4" straight, covered heel shows off signature red sole. Creamy leather lining with padded insole. "Fifi" is made in Italy. Learn which terms in descriptions are depictable attributes

  17. Given Web Images + Noisy Text Descriptions: 1) Discover visual attribute terms in text descriptions - likely domain dependent 2) Learn appearance models for attributes without labeled data 3) Characterize attributes by: type, localizability

  18. Object Recognition Scenes For example: [Oliva, Torralba 2001], [SUN 2010], … Slide Credit: Ali Farhadi

  19. What are the right quanta of Recognition? Farhadi & Sadeghi Recognition using Visual Phrases , CVPR 2011

  20. Participating in Phrases Profoundly affects the appearance of objects Farhadi & Sadeghi Recognition using Visual Phrases , CVPR 2011

  21. What should we recognize? Maybe descriptive text can inform entity hypotheses! “a sleeping dog in NTHU” “the dog is sleeping” “A dog is sleeping in” “sleeping dog in delhi”

  22. What should we recognize? Maybe descriptive text can inform entity hypotheses! “the cat is in the bag” “cat in a bag” “cat in bag” “cat in the bag”

  23. Conclusion • Use large pools of descriptive text to: • Learn how people describe the visual world • Learn how the world works • Guide future efforts in recognition • Apply this knowledge to multi-modal collections & applications

  24. Acknowledgements • Collaborators: Alex Berg, David Forsyth, JaetyEdwards, Jonathan Shih, GirishKulkarni, VisruthPremraj, SagnikDhar, Vicente Ordonez, Siming Li, Yejin Choi, Kota Yamaguchi, Vicente Ordonez • Funded by NSF Faculty Early Career Development (CAREER) Program: Award #1054133

More Related