280 likes | 583 Views
Interactive Event Detection in Video and Audio. Rahul Sukthankar Intel Research Pittsburgh & Carnegie Mellon University. Contributors. Diamond team: L. Huston, Satya, L. Mummert, C. Helfrich, L. Fix Forensic video retrieval: J. Campbell, P. Pillai, Diamond team
E N D
Interactive Event Detection in Video and Audio Rahul SukthankarIntel Research Pittsburgh &Carnegie Mellon University
Contributors • Diamond team: L. Huston, Satya, L. Mummert, C. Helfrich, L. Fix • Forensic video retrieval:J. Campbell, P. Pillai, Diamond team • Volumetric video analysis:Y. Ke, M. Hebert • Sound object detection in soundtracks:D. Hoiem, Y. Ke • Interactive search-assisted diagnosis for breast cancer:Y. Liu, R. Jin, B. Zheng, D. Jukic
Why Interactive Event Detection? • Events of interest are often not known a priori • Data exploration: “find me more things like this” • User’s requirements change based on partial results • Surveillance: “Alert me if you see X… hmm… actually I want Y” • Challenges: • Limited training data • can we still learn good event detectors? • Efficiency • how best to organize/index/pre-process the data?
Outline • Event detection in audio • sound object detection from a few examples • Diamond • efficient search of non-indexed data • Event detection in video • forensic video surveillance • volumetric analysis for action detection
Example: Sound Object Detection • Applications of sound object detection • “Alert me if you hear a gunshot.” (monitoring) • “Fast forward to the next swordfight in LotR” (search and retrieval) • Approach: • Learn boosted classifier from ~5-10 examples of the object • Scan windowed classifier over all possible locations Clip 1 Clip Classifier … Classify each clip as object or non-object Return locations of detected sound object Audio stream Clip N [D. Hoiem, Y. Ke, R. Sukthankar, ICASSP 2005]
138 Features Decision nodes Leaf Nodes Sound Object Detection: Clip Classifier • Feature extraction • Weak classifier – small decision trees on features • Learn classifier cascade using Adaboost … [D. Hoiem, Y. Ke, R. Sukthankar, ICASSP 2005]
Best Performance Worst Performance Sound Object Detection: Results
Framework for Interactive Event Detection • Interactive event detection =?= non-indexed search • Search and indexing: • If queries can be predicted in advance, indexing is possible(e.g., Google for text data) • Alternative is brute-force search through non-indexed data • How to perform efficient non-indexed search? • May need to execute arbitrary code (learned event detector)
query results discard Brute-Force Search • Event detection: vast majority of the data is useless • BFS scales poorly with storage volume Search app Storage User
query query’ results late discard early discard Diamond: Early Discard • Reject as close to storage as possible • Reduce volume of data transferred • Scales much better! Search app Storage User
Searchlet API Host runtime Assoc DMA Search Application Linux Diamond Architecture Assoc DMA Searchlet App Code (proprietary or open) Filter API Storage Runtime Diamond API (open) Diamond code (open) Assoc DMA Searchlet Storage access protocol (open) Filter API Storage Runtime Assoc DMA Searchlet Diamond is a collaborative projectbetween Intel Research & CMU Filter API Storage Runtime
Anatomy of a Diamond Searchlet • Sequence of partially-ordered “filters” • each filter can pass or drop an object • filters share state through attributes • Diamond determines an optimal filter order
Timely reconstruction of a crime scene large quantities of video surveillance data current practice: gather & manually scan video tapes obvious optimization: transfer data to central site Better solution: send your detector to the data cam cam cam cam cam App Host Example Application: Forensic Video Surveillance [J. Campbell et al., VSSN 2004]
T X Y Idea: Treat Video as a Volume
Related work: Recognition usingSVMs on Space-Time Interest Points Space-time interest points Figures courtesy: [Schuldt et al., ICPR 2004]
Problem with Space-Time Interest Points:Too Sparse Two examples of smooth motions where no stable space-time interest points are detected.
T X Y Our Features: 3D Extension of Viola-Jones Volumetric features Integral Volume (x, y, t) Volumetric features can be efficiently computed using integral volumes, with only 8 memory accesses per feature. The sum of the volume ise – a – f – g + b + c + h – d.
T X Y Classifier cascade learned usingDirect Feature Selection, Wu et al., NIPS, 2002 Millions of potential features for selection, so Adaboost is too slow. An example of the features learned by the classifier to recognize the hand-wave action in a detection volume
Detection • Use a sliding volume over video sequence • Model true event as a cluster of detections with Gaussian distribution.
Generic Volumetric Features • Processing non-indexed video is slow – lots of data • Are there application-independent representations for video? • Goal: pre-process video once, support multiple video event apps. [Y. Ke, unpublished 2006]
Related work:Space-Time Behavior Based Correlation Figures courtesy: [Shechtman & Irani, CVPR 2005]
Interactive Search-Assisted Diagnosis ISAD Results Rank1: benignbiopsy CLOSE? suspiciousmass (query) Rank2: benignbiopsy Rank3: malignantbiopsy Collaborators:B. Zheng, D. Jukic, L. Yang, R. Jin
Query-adaptive Local Distance Learning • Previously: • Various Lp norms: Euclidean distance is typically not the best • Global metric learning: • Learn metric that best satisfies user-given pairwise data constraints • Fares poorly with multimodal data • Local metric learning: • Learn metric that does above, but weighs nearby constraints higher • Chicken & egg problem • What’s new: • Learn a metric for the given query based on neighborhood
Summary • Many real applications require interactive event detection • Good for ML algorithms that: • operate with limited training data • train quickly/incrementally • exploit unlabeled data • Diamond – infrastructure for efficient non-indexed search http://diamond.cs.cmu.edu/ • Interactive event detection in video is still painful • Good general-purpose representation for event detection?