210 likes | 316 Views
An Architecture for Mining Resources Complementary to Audio-Visual Streams. J. Nemrava, P. Buitelaar, N. Simou, D. Sadlier, V. Sv á tek, T. Declerck, A. Cobet, T. Sikora, N. O'Connor, V. Tzouvaras, H. Zeiner, J. Petr ák. Introduction.
E N D
An Architecture for Mining Resources Complementary to Audio-Visual Streams J. Nemrava, P. Buitelaar, N. Simou, D. Sadlier, V. Svátek, T.Declerck, A. Cobet, T. Sikora, N. O'Connor, V. Tzouvaras, H. Zeiner, J. Petrák
Introduction • Video retrieval can strongly benefit from textual sources related to the A/V stream • Vast textual resources available on the web can be used for fine-grained event recognition. • Good example is sport-related videos • Summaries of matches • Tabular (list of player, cards, substitutions) • Textual (minute-by-minute reports)
Available Resources • Audio-Video Streams • A/V analysis captures features from the video using suitable detectors • Primary Complementary • Directly attached to the media • Overlay text, spoken commentaries, • Secondary Complementary • Independent from the media • Written commentaries, summaries, analysis
Audio-Video Analysis • Crowd image detector • Speech-Band Audio Activity • On-Screen Graphics Tracking • Motion activity measure • Field Line orientation • Close-up
Primary complementary resources • Video track • Overlay text OCR • text region detection • Time synchronization • Merging 16 frames to recognizemoving from static objects in the video • Textual information such as overlay text and players numbers provide additional primary resource • Audio track • Speech commentaries
Secondary Complementary Resources • Tabular • Summaries, list of players, goals, cards • “meta” information • Location, referee, attendance, date
Secondary Complementary Resources • Unstructured • Several minute-by-minute sources • Text analysis and event extraction using SPRouT • Player actions • Player Names • German and English • ‘A beautiful pass by Ruud Gullit set up the first Rijkaard header.’ SProUT Ontology based IE tool
Ontology • SProUT uses SmartWeb football ontology for • Player action • Referee action • Trainer action
Reasoning over complementary resources of football games • Textual Sources (per coarse-grained minute) • Extraction of semantic concepts from unstructured texts using DFKI ontology based information extraction tool • Video Analysis (for every second) - DCU • Crowd image detector – values range ∈ [0,1] • Speech-Band Audio Activity - values range ∈ [0,1] • Motion activity measure - values range ∈ [0,1] • Close-up - values range ∈ [0,1] • Field Line orientation - values range ∈ [0,90]
Video Analysis Fuzzification Similar process for motion, close up and crowd detectors • A period of 20 seconds is evaluated • A threshold value was set according to the detectors mean value during the game. • Top value was mapped to [0,1]
Video Analysis Fuzzification • Line angle • Values between 0-7 are Middle Field • Values between 17-27 are End of Field • Fuzzification according to their occurrences in the period of 20 seconds • Example • Middle Field 13 occurrences Fuzzy Value = 0.65 • End of Field 4 occurrences Fuzzy Value = 0.2 • Other 3 occurrences Fuzzy Value = 0.15
Declaring Alphabet … Concepts= {Scoringopportunity Outofplay Handball Kick Scoregoal Cross Foul Clear Cornerkick Dribble Freekick Header Trap Shot Throw Pass Ballpossession Offside Charge Lob Challenge Booked Goalkeeperdive Block Save Substitution Tackle EndOfField MiddleField Other Crowd Motion CloseUp Audio} Roles= {consistOf} Individuals= {min0 sec20 sec40 sec60 min1 sec80 sec100 sec120 min2 sec140 sec160 sec180 min3 sec200…}
Knowledge Representation- ABox 〈 min1 : Kick≥ 1 〉 〈 min1 : Scoregoal≥ 1 〉 〈 sec80 : Audio≥ 0.06 〉 〈 sec80 : Crowd≥ 0.231 〉 〈 sec80 : Motion≥ 0.060 〉 〈 sec80 : EndOfField≥ 0.05 〉 〈(min1 : sec60 ) : consistOf ≥ 1〉 〈(min1 : sec80 ) : consistOf ≥ 1〉 〈(min1 : sec100 ) : consistOf ≥ 1〉 〈(min1 : sec120 ) : consistOf ≥ 1〉
Cross-Media Features • Basic idea • Identify which video detectors are more prominent for which event class • For instance for CORNERKICK the “end-zone” video detector should be significantly high • Strategy • Analyze distribution of video detectors over event classes • Identify significant detectors for each class • Feedback into the video event detection algorithm
Cross-Media Features • purpose of the cross-media descriptors is to capture the features and relations in multimodal data so as to be able to retrieve complementary information when dealing with one of the data sources • build up model to classify events in video independently from the video • Use of cross-media features in event-type classification of video segments by use of fuzzy reasoning with the FiRe inference engine • Fire is focused on events retrieval