Object detection in videos – Attention based cues

Mentor : Prof. AmitabhaMukherjee Object detection in videos –Attention based cues - ShubhamTulsiani (Y9574)

The Importance of Attention • Object detection algorithms are computationally expensive • Modeling Attention is a biologically motivated way of preselecting regions for further costly computations

Attention Based Approaches (Images) Previous Works • Attention based approaches have often been used for Static Images. Some examples are - • Itti Koch Saliency Model • Contextual Cues combined with Saliency for search tasks • Feature, Context and Saliency based attention model

Itti Koch Saliency This models provides a measure of the ’saliency’ of each location in the image across various low-level features (contrast, color, orientation, texture, motion) It is a primitive model of attention used for object detection

Context and Saliency based Attention - Eye Movements and attention - Torralba • Human visual system makes extensive use of contextual information for facilitating object search in natural scenes • This, combined with saliency was used to model attention for object detection

Features, Context and Saliency A combination of Saliency, Context and Feature based cues has been used to obtain more evolved attention models

Performance of Various Attention Models on images It has been observed that the combined model of visual attention performs better than isolated ones Thus, human visual attention is driven by various factors which combined models can capture more effectively

Object Detection in Videos: Challenges • A lot of data !! • Applying an object detector for static images on each frame is very costly

Object Detection in Videos: Advantages • We can exploit the information in across the frames for an effective detector. This will eliminate false positives which often occur in images • Various attention based cues do not have to be recomputed every frame

Some Common Approaches • Feature based Object detection • Motion based Object Detection • There is no notable visual attention based approach for object detection in Videos

A Proposed Methodology • We should compute maps for various cues which drive our attention like Saliency, Motion in video, Context and Feature resemblance • We can combine these cues to obtain a model for visual attention which gives us the regions for interest for object detection

Saliency Based Cues A black dot on a white board is salient and draws our attention • Saliency is a bottom-up cue i.e saliency maps are independent of the object being searched for or the semantic content of the video. • High saliency represents that the region stands out from its surroundings.

Motion Detection • We would like to focus our attention on regions where motion is detected because there is a higher probability of that the object of interest would be present • This cue is also bottom-up and corresponds to saliency in a temporal sense

Contextual Cues A person is more likely to be present on the road than in the sky The context map for pedestrian detection will show higher values for regions near the ground • Give an indication of where the object is more likely to be present • Does not have to be computed very frequently in a video(specially for a static camera)

Feature Based Cues While searching for a snake, we are likely to focus on long, thin objects • Indicate resemblance to the object being searched for • Instead of features from a static frame, we should take into account the feature from a set of frames • We can learn how the object looks in a sequence of frames

Feature Based Cues • This can be achieved by modifying our base static detection approach to represent dynamic information by extending the static representation into the time domain • More complex approaches can be used but since the aim is to get a computationally inexpensive feature map, the above will suffice

Combining Cues A object detection model for person should have more weight to motion cue as compared to a model for trees • We can combine the cues to obtain a model for Visual Attention in videos for the object to be detected • The combined map would determine the regions of interest in the video • For a general model applicable across all objects, we should be able to learn the weights to be associated with each of the cues

An Implementation : Overview • We learn a detector for humans in videos based on the proposed methodology • We test our model on videos from an annotated video database ‘LabeME Video’ • We use the maps for Saliency, Motion, Context and Features to detect regions of interest

A sample Video

Saliency • We have used the Itti Koch model to compute the saliency maps • Since computing saliency is computationally effective, we have computed it for every frame but this may be made more effective

Saliency

Motion Detection • We highlight those regions where the value of pixels differs from the corresponding pixels in the previous frames (this approach does not work for moving cameras) • We have taken into account the slight instability of hand-held cameras in the computation of these motion maps

Motion Detection

Contextual Cues • Used over 600 images from the LabelMe database to train a context model • Since the context does not rapidly change in a video, we recompute it after a set of 10 frames

Contextual Cues Context Original

Feature Based Cues • We have trained a Viola-Jones algorithm based detector using adaboost over 1,00,000 base features • To take into account the temporal aspects of features, we will normalise the map over a set of frames

Feature Based Cues

The Dynamic Attention Model • We combine the various cues to obtain the model for visual attention in videos for pedestrian detection • Further, we can now select the region above a certain threshold (20%) for object detection via a costly algorithm

A Demonstration

Future Scope of Work • We can interpolate the various maps over time for more effective detectors • Alternate models for the various cues can be used • The proposed model can be extended to be implemented in real time

References • A Trainable System for Object Detection in Images and Video Sequences - Constantine P. Papageorgiou • Modeling Search for People in 900 Scenes : A combined source model of eye guidance – Torralba et. al • Object Detection and Tracking in Video - ZhongGuo • Various Databases and Code Sources

Thank YOU

Object detection in videos – Attention based cues

Object detection in videos – Attention based cues

Presentation Transcript

Object-based Programming

SAS Deep Learning Object Detection, Keypoint Detection

Object Detection and Tracking

object of attention ?

Object detection

IR Object Detection

An attempt to integrate theories of object-based attention and space-based attention.

Activity Detection in Videos

Event Detection Using an Attention-Based Tracker

Object-Based Databases

Object Detection and Recognition

Event Detection Using an Attention-Based Tracker

Example-Based Object Detection in Images by Components

A Cloud Object Based Volcanic Ash Detection Technique

Cues !

Object Detection

Object Detection by Matching

Abnormal Object Detection by Canonical Scene -based Contextual Model

General object detection with deformable part-based models

“Secret” of Object Detection

Object Detection with Superquadrics

Generic object detection with deformable part-based models