250 likes | 366 Views
Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation. Authors: Luo Ji , Barbara Caputo and Vittorio Ferrari. Presenter: Maresh Naresh Singh. Aim. Given: N ews items consisting of images with their associated text.
E N D
Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation Authors: LuoJi, Barbara Caputo and Vittorio Ferrari Presenter: MareshNaresh Singh
Aim • Given: News items consisting of images with their associated text. • Goal: Figure out who is doing what?
Who is doing what? • Guess possible action of a person in the image. • Use pose as well as verb for this purpose. • Associate actions with the person in the image. • Predict the name of the person.
(b) US Democratic presidential candidate Senator Barack Obama wavesto supporters together with his wife Michelle Obama standingbeside him at his North Carolina and Indiana primary election night rally in Raleigh.
(a) Four sets ... Roger Federer preparesto hit a backhand in a quarter-final matchwith Andy Roddick at the US Open.
Correspondence ambiguity problem. • Multiple persons in the image and captions. • Person in the image but not mentioned in the caption. • Mention in the caption but not present in the image.
Idea • The title “Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation”
Generative Model • Observed variables: Names and verbs in the caption. Detected persons in the image. • Latent Variables: Image-caption correspondence. • Parameters: Visual appearances of face and pose classes corresponding to different names and verbs. • EM to compute hidden variables.
Face and pose recognition • Uses face detector and upper body detector. • Face and upper-body are considered to belong to same person if the face lies in the center of upper-body bounding box.
Name-Verb pair. • Language parser extracts name-verb pair from each caption. • Uses OpenNLP.
Probability Function • Uses EM to maximize the above function.
… • Maximizing the previous equation somehow boils down to minimizing the equation:
EM algorithm (Initialization) • Compute distance matrix between faces/poses from images sharing some name/verb in the caption. • For each name/verb pair, select all captions containing only that name/verb. • If the corresponding images contain only one person, their faces/poses are used to initialize the center vectors • If the corresponding images always contain multiple players, assign person by random selection.
Comments • Better results on the chosen dataset. • Somewhat successful in recognizing persons in images without captions.
Comments • Assumes independence between persons in an image. • Limited dataset of 1610 images used for experimentation. • Manual involvement in writing captions. • Images collected using search queries like “Barack Obama” + “Shake hands” • Such queries results in images with strong correspondence between pose and face.