350 likes | 560 Views
Mentor : Prof. Amitabha Mukherjee. Object detection in videos – Attention based cues. - Shubham Tulsiani (Y9574). The Importance of Attention. Object detection algorithms are computationally expensive
E N D
Mentor : Prof. AmitabhaMukherjee Object detection in videos –Attention based cues - ShubhamTulsiani (Y9574)
The Importance of Attention • Object detection algorithms are computationally expensive • Modeling Attention is a biologically motivated way of preselecting regions for further costly computations
Attention Based Approaches (Images) Previous Works • Attention based approaches have often been used for Static Images. Some examples are - • Itti Koch Saliency Model • Contextual Cues combined with Saliency for search tasks • Feature, Context and Saliency based attention model
Itti Koch Saliency This models provides a measure of the ’saliency’ of each location in the image across various low-level features (contrast, color, orientation, texture, motion) It is a primitive model of attention used for object detection
Context and Saliency based Attention - Eye Movements and attention - Torralba • Human visual system makes extensive use of contextual information for facilitating object search in natural scenes • This, combined with saliency was used to model attention for object detection
Features, Context and Saliency A combination of Saliency, Context and Feature based cues has been used to obtain more evolved attention models
Performance of Various Attention Models on images It has been observed that the combined model of visual attention performs better than isolated ones Thus, human visual attention is driven by various factors which combined models can capture more effectively
Object Detection in Videos: Challenges • A lot of data !! • Applying an object detector for static images on each frame is very costly
Object Detection in Videos: Advantages • We can exploit the information in across the frames for an effective detector. This will eliminate false positives which often occur in images • Various attention based cues do not have to be recomputed every frame
Some Common Approaches • Feature based Object detection • Motion based Object Detection • There is no notable visual attention based approach for object detection in Videos
A Proposed Methodology • We should compute maps for various cues which drive our attention like Saliency, Motion in video, Context and Feature resemblance • We can combine these cues to obtain a model for visual attention which gives us the regions for interest for object detection
Saliency Based Cues A black dot on a white board is salient and draws our attention • Saliency is a bottom-up cue i.e saliency maps are independent of the object being searched for or the semantic content of the video. • High saliency represents that the region stands out from its surroundings.
Motion Detection • We would like to focus our attention on regions where motion is detected because there is a higher probability of that the object of interest would be present • This cue is also bottom-up and corresponds to saliency in a temporal sense
Contextual Cues A person is more likely to be present on the road than in the sky The context map for pedestrian detection will show higher values for regions near the ground • Give an indication of where the object is more likely to be present • Does not have to be computed very frequently in a video(specially for a static camera)
Feature Based Cues While searching for a snake, we are likely to focus on long, thin objects • Indicate resemblance to the object being searched for • Instead of features from a static frame, we should take into account the feature from a set of frames • We can learn how the object looks in a sequence of frames
Feature Based Cues • This can be achieved by modifying our base static detection approach to represent dynamic information by extending the static representation into the time domain • More complex approaches can be used but since the aim is to get a computationally inexpensive feature map, the above will suffice
Combining Cues A object detection model for person should have more weight to motion cue as compared to a model for trees • We can combine the cues to obtain a model for Visual Attention in videos for the object to be detected • The combined map would determine the regions of interest in the video • For a general model applicable across all objects, we should be able to learn the weights to be associated with each of the cues
An Implementation : Overview • We learn a detector for humans in videos based on the proposed methodology • We test our model on videos from an annotated video database ‘LabeME Video’ • We use the maps for Saliency, Motion, Context and Features to detect regions of interest
Saliency • We have used the Itti Koch model to compute the saliency maps • Since computing saliency is computationally effective, we have computed it for every frame but this may be made more effective
Motion Detection • We highlight those regions where the value of pixels differs from the corresponding pixels in the previous frames (this approach does not work for moving cameras) • We have taken into account the slight instability of hand-held cameras in the computation of these motion maps
Contextual Cues • Used over 600 images from the LabelMe database to train a context model • Since the context does not rapidly change in a video, we recompute it after a set of 10 frames
Contextual Cues Context Original
Feature Based Cues • We have trained a Viola-Jones algorithm based detector using adaboost over 1,00,000 base features • To take into account the temporal aspects of features, we will normalise the map over a set of frames
The Dynamic Attention Model • We combine the various cues to obtain the model for visual attention in videos for pedestrian detection • Further, we can now select the region above a certain threshold (20%) for object detection via a costly algorithm
Future Scope of Work • We can interpolate the various maps over time for more effective detectors • Alternate models for the various cues can be used • The proposed model can be extended to be implemented in real time
References • A Trainable System for Object Detection in Images and Video Sequences - Constantine P. Papageorgiou • Modeling Search for People in 900 Scenes : A combined source model of eye guidance – Torralba et. al • Object Detection and Tracking in Video - ZhongGuo • Various Databases and Code Sources