An Architecture for Mining Resources Complementary to Audio-Visual Streams

An Architecture for Mining Resources Complementary to Audio-Visual Streams J. Nemrava, P. Buitelaar, N. Simou, D. Sadlier, V. Svátek, T.Declerck, A. Cobet, T. Sikora, N. O'Connor, V. Tzouvaras, H. Zeiner, J. Petrák

Introduction • Video retrieval can strongly benefit from textual sources related to the A/V stream • Vast textual resources available on the web can be used for fine-grained event recognition. • Good example is sport-related videos • Summaries of matches • Tabular (list of player, cards, substitutions) • Textual (minute-by-minute reports)

Available Resources • Audio-Video Streams • A/V analysis captures features from the video using suitable detectors • Primary Complementary • Directly attached to the media • Overlay text, spoken commentaries, • Secondary Complementary • Independent from the media • Written commentaries, summaries, analysis

Audio-Video Analysis • Crowd image detector • Speech-Band Audio Activity • On-Screen Graphics Tracking • Motion activity measure • Field Line orientation • Close-up

Primary complementary resources • Video track • Overlay text OCR • text region detection • Time synchronization • Merging 16 frames to recognizemoving from static objects in the video • Textual information such as overlay text and players numbers provide additional primary resource • Audio track • Speech commentaries

Secondary Complementary Resources • Tabular • Summaries, list of players, goals, cards • “meta” information • Location, referee, attendance, date

Secondary Complementary Resources • Unstructured • Several minute-by-minute sources • Text analysis and event extraction using SPRouT • Player actions • Player Names • German and English • ‘A beautiful pass by Ruud Gullit set up the first Rijkaard header.’ SProUT Ontology based IE tool

Ontology • SProUT uses SmartWeb football ontology for • Player action • Referee action • Trainer action

Architecture Overview 1 2

Architecture overview

Reasoning over complementary resources of football games • Textual Sources (per coarse-grained minute) • Extraction of semantic concepts from unstructured texts using DFKI ontology based information extraction tool • Video Analysis (for every second) - DCU • Crowd image detector – values range ∈ [0,1] • Speech-Band Audio Activity - values range ∈ [0,1] • Motion activity measure - values range ∈ [0,1] • Close-up - values range ∈ [0,1] • Field Line orientation - values range ∈ [0,90]

Video Analysis Fuzzification Similar process for motion, close up and crowd detectors • A period of 20 seconds is evaluated • A threshold value was set according to the detectors mean value during the game. • Top value was mapped to [0,1]

Video Analysis Fuzzification • Line angle • Values between 0-7 are Middle Field • Values between 17-27 are End of Field • Fuzzification according to their occurrences in the period of 20 seconds • Example • Middle Field 13 occurrences Fuzzy Value = 0.65 • End of Field 4 occurrences Fuzzy Value = 0.2 • Other 3 occurrences Fuzzy Value = 0.15

Declaring Alphabet … Concepts= {Scoringopportunity Outofplay Handball Kick Scoregoal Cross Foul Clear Cornerkick Dribble Freekick Header Trap Shot Throw Pass Ballpossession Offside Charge Lob Challenge Booked Goalkeeperdive Block Save Substitution Tackle EndOfField MiddleField Other Crowd Motion CloseUp Audio} Roles= {consistOf} Individuals= {min0 sec20 sec40 sec60 min1 sec80 sec100 sec120 min2 sec140 sec160 sec180 min3 sec200…}

Knowledge Representation- ABox 〈 min1 : Kick≥ 1 〉 〈 min1 : Scoregoal≥ 1 〉 〈 sec80 : Audio≥ 0.06 〉 〈 sec80 : Crowd≥ 0.231 〉 〈 sec80 : Motion≥ 0.060 〉 〈 sec80 : EndOfField≥ 0.05 〉 〈(min1 : sec60 ) : consistOf ≥ 1〉 〈(min1 : sec80 ) : consistOf ≥ 1〉 〈(min1 : sec100 ) : consistOf ≥ 1〉 〈(min1 : sec120 ) : consistOf ≥ 1〉

Knowledge Representation- TBox

Query Examples

Architecture Overview 1 2

Cross-Media Features • Basic idea • Identify which video detectors are more prominent for which event class • For instance for CORNERKICK the “end-zone” video detector should be significantly high • Strategy • Analyze distribution of video detectors over event classes • Identify significant detectors for each class • Feedback into the video event detection algorithm

Cross-Media Features • purpose of the cross-media descriptors is to capture the features and relations in multimodal data so as to be able to retrieve complementary information when dealing with one of the data sources • build up model to classify events in video independently from the video • Use of cross-media features in event-type classification of video segments by use of fuzzy reasoning with the FiRe inference engine • Fire is focused on events retrieval

Thank you for your attention

An Architecture for Mining Resources Complementary to Audio-Visual Streams

An Architecture for Mining Resources Complementary to Audio-Visual Streams

Presentation Transcript

Audio visual hire

Audio processing for Visual Observers

Mining Data Streams

NWNW Audio Visual Resources AnD More

AUDIO VISUAL AIDS

Audio-Visual Solutions

Audio-Visual Solutions

Mining Data Streams

Mining Data Streams

Audio/Visual

Audio Visual Training

Mining Data Streams

Audio Visual Training

Audio Visual Training

Important Things to Look for in an Audio Visual Solutions Team

How to Judge an Audio-Visual Rental Agency for Events

Data Mining for Data Streams

Audio-visual projects

Mining Data Streams

Mining Data Streams