290 likes | 467 Views
Exploiting Human Abilities in Video Retrieval Interfaces. Maximizing the Synergy between Man and Machine. Alex Hauptmann School of Computer Science Carnegie Mellon University. Background. Automatic video analysis for detection/recognition is still quite poor
E N D
Exploiting Human Abilities in Video Retrieval Interfaces Maximizing the Synergy between Man and Machine Alex Hauptmann School of Computer Science Carnegie Mellon University
Background Automatic video analysis for detection/recognition is still quite poor • Consider baseline (random guessing) • Improvement is limited • Consider near-duplicates (trivial similarity) • Does not generalize well over video sources • Better than nothing Need humans to make up for this shortcoming!
Differences to VIRAT • Most interface work done on broadcast TV • Harder: • Unconstrained subject matter • Graphics, animations, photos • Broadcasters • Many different, short shots out of context • Easier: • Better resolution • Conventions in editing, structure • Audio track • Keyframes are typical unit of analysis
“Classic” Interface Work • Interactive Video Queries • Fielded text query matching capabilities • Fast image matching with simplified interface for launching image queries • Interactive Browsing, Filtering, and Summarizing • Browsing by visual concepts • Quick display of contents and context in synchronized views
“Classic” Video Interface Results • Concept browsing and image search used frequently • Novices still have lower performance than experts • Some topics cause “interactivity” to be one-shot query with little browsing/exploration • Classic Informedia interface including concept browsing often good enough that user never proceeds to any additional text or image query • “Classic Informedia” scored highest of those testing with novice users in TRECVID evaluations
Augmented Video Retrieval The computer observes the user and LEARNS, based on what is marked as relevant The system can learn: • What image characteristics are relevant • What text characteristics (words) are relevant • What combination weights should be used We exploit the human’s ability to quickly mark relevant video and the computer’s ability to learn from given examples
Text Find Pope John Paul Audio Motion Image multimodal question Videolibrary … Closed Caption Audio Feature Motion Feature Color Feature … (3k) Face Building Knowledge Source/API Output1 Output2 Outputn Combination of diverse knowledge sources Final ranked list for interface Combining Concept Detectors for Retrieval
Q-Type: general object Txt: 0.5, Img: 0.3, Face: -0.5 Outdoor: ?, Ocean: ? (Learned from training set) (Unable to be learned) Why Relevance Feedback? • Limited training data • Untrained sources are useful for some specific searches Query: finding some boats/ships
Probabilistic Local Context Analysis (pLCA)Yan07 • Goal: Refine results of the current query • Method: assume the combination parameters of “un-learned” sources υto be latent variables andcompute P(yj|aj,Dj) • Discover useful search knowledge based on initial results Ai υ1:?, υ2:?,…, υN:? Query Y1 A1 Video1 Initial Search Result A2 Y2 Video 2 Ym Am Video M
Undirected Model and Parameter Estimation Compute the posterior probability of document relevance Y given initial results A based on an undirected graphical model • Variational inference, i.e., iterate until convergence and approximate P(yj|aj) by qyj • Maximize w.r.t. variational para. of Y, • Maximize w.r.t. variational para. of v υ A1 Y1 D1 A2 Y2 D2 Am Ym Dm
Extreme Video Retrieval • Automatic retrieval baseline for ranking order • Two methods of presentation: • System-controlled Presentation - Rapid Serial Visual Presentation (RSVP) • User-controlled Presentation – Manual Browsing with Resizing of Pages
System-controlled Presentation • Rapid Serial Visual Presentation (RSVP) • Minimizes eye movements • All images in same location • Maximizes information transfer: System Human • Up to 10 key images/second • 1 or 2 images per page • Presentation intervals are dynamically adjustable • Click when relevant shot is seen • Mark previous page also as relevant • A final verification step is necessary
User-controlled presentations • Manual Browsing with Resizing of Pages • Manually page through images • User decides to view next page • Vary the number of images on a page • Allow chording on the keypad to mark shots • A very brief final verification step
Extreme QA with RSVP 3x3 display 1 page/second Numpad chording to select shots
Mindreading – an EEG interface • LearnRelevant/Non-Relevant • 5 EEG probes • Simple features • Too slow • 250 ms/image • Significant recoverytime after hit • Relevance feedback
Summarizing Video: Beyond Keyframes • BBC Rushes • Unedited video for TV series production • Summarize as video in 1/50th of total • Note the non-scalable target factor • Lots of smart analysis • Clustering, salience, redundancy, importance • Best performance for retrieval was to play every frame
Surveillance Event Detection • Interesting stuff is rare • Detection accuracy is limited • Monitor many streams
Surveillance Event Detection • Need action not key frame • Difficult for humans • Combine speed-upwith automatic analysis • Slow down when interesting stuff happens
Summary Interfaces have much to contribute in retrieval • We don’t know what is best • Task-specific • User-specific • System-dependent • Collaborative search • Combining “best of current systems” • Simpler is usually better (Occam’s razor) General principles are difficult to find