Star Challenge – multimedia search competition 2008

Star Challenge – multimedia search competition 2008 NUS.SIGIR group Luong Minh Thang & Zhao Jin WING group meeting – 12 Sep, 2008

Agenda • About StarChallenge • Approaches • Audio system • Video system • Results

Let’s start with a clip on Tai Chi!

The Star Challenge • International Competition organized by Singapore A*STAR • Focus on Multimedia Search by Voice and Video • Prize: • Free Trip to Singapore (blah!) • USD 100,000 (!!!)

The Tasks • Voice Search • AT1: Search by IPA (International Phonetic Alphabet) • AT2: Search by Example • AT3: Search for recurrent voice segments • Video Search • VT1: Search by (single) Query Image • VT2: Search by Video Shot • VT3: Scene/Event Categorization AT3 and VT3 replaced by integrated search in the end

Timeline • Mar 31: Registration Deadline • Registered as adMIRer • 5 members from NUS-SIGIR • 56 teams registered in total • June 18: 1st Knockout Round • AT1+AT2 • 8 Teams qualified

Timeline • July 18: 2nd Knockout Round • VT1+VT2 • 7 Teams qualified • September 4: Qualifying Race • All four tasks with Integrated Search • Only 5 Teams would qualify • October 23: Grand Final • On-site evaluation

Audio system – general approach • Use MFCC - well reflects speech • Use local alignment to align 2 sequences of audio & query • Using spectrogram, we cut up long audio into small segments for better matching.  Short demo

Audio system – system overview Query audio files Test audio files Audio feature extractor Speech recognizer Test text Query MFCC vectors Test MFCC vectors Lucene indexing Query text Index data Query-test similarity matrix Alignment & matching Lucene matching Heuristic fusion Results

Audio system – Handle IPA • " i n t r ^ s t r ei t”: IPA query • Translate to CMU phonemes: IH N T R AH S T R EY T • INTEREST: IH N T R AH S T • RATE: R EY T • Query text: input to text module directly synthezied to audio file for audio module

Audio system – overall performance • Not have complete statistics yet, but AT2 (query by example) ~ 30-40% MAP, AT1 ~ 10 % • Let’s listen to a few queries …

Video system – VT1 categories • 11. Swimming pool, sports • 12. Closeup of hand, e.g. using mouse, writing, etc • 13. Business meeting (> 2 people), mostly seated down, table visible • 14. Natural scene, e.g. mountain, trees, sea, no pple • 15. Food on dishes, plates • 16. Face closeup, occupying about 3/4 of screen, frontal or side • 17. Traffic Scene, many cars, trucks, road visible • 18. Boat/Ship, over sea, lake • 19. PC Webpages, screen of PC visible • 120. Airplane • 1. Crowd (>10 people) • 2. Building with sky as backdrop, clearly visible • 3. Mobile devices including handphone/PDA • 4. Flag • 5. Electronic chart, e.g. stock charts, airport departure chart • 6. TV chart Overlay, including graphs, text, powerpoint style • 7. Person using Computer, both visible • 8. Track and field, sports • 9. Company Trademark, including billboard, logo • 10. Badminton court,

Video system - examples 16. Face closeup 9. Company trademark 2. Building with sky backdrop 3. Mobile devices

Video system – VT2 categories • 1. People entering/exiting door/car • 2. Talking face with introductory caption • 3. Fingers typing on a keyboard • 4. Inside a moving vehicle, looking outside • 5. Large camera movement, tracking an object, person, car, etc • 6. Static or minute camera movement, people(s) walking, legs visible • 7. Large camera movement, panning left/right, top/down of a scene • 8. Movie ending credit • 9. Woman monologue • 10. Sports celebratory hug

Video system – general approach Test files classifiers Classified cateogry Category filtering Query category Filtered test files Matching Query file Matched test files

Video system - Training data size Development data statistics • Dev = 10% labelled data, Train = 90% labelled data • Size varies significantly across different categories

Video system – classifier training Train key frames + categories Color extractor Edge extractor Face detector Layout extractor Color classifier Edge classifier Face classifier Layout classifier Color histogram (HSV, RGB) Edge histogram Num faces, size, positions Segmentation info Multi-class SVM training Dev key frames Color recall /categories Edge recall /categories Facerecall /categories Layout recall /categories Uses as weights

Classifer recall/categories • Uses as weights when fusing all different classifier • No miror analysis & n-fold testing yet

Video system – Category filtering & Matching Test video Test Key frames Motion extractor Color extractor Edge extractor Face detector Layout extractor motion histogram; camera & object motion Color histogram (HSV, RGB) Edge histogram Num faces, size, positions Segmentation info Color classifier Edge classifier Face classifier Layout classifier Classifier merger (weights from dev data) Heuristic category filtering Category filtering Query video/frames Query category Matching Filtered video Filtered key frames Results

Video system – motion 1 Camera: panning left Camera: panning up Object motion: static Object motion: moving

Video system – motion 2 • Check if most vector ~ 0  static motion • Otherwise, filter all small motion vectors • Categories motion vectors into circle bins •  histogram. + main vector motion • If main vector motion dominates  camera motion  panning left, right, up, down • To detect zooming, find a focus block/point • Object motion is derived after removing camera motion

Conclusion • We have built up a full-function system within a short time and in an ad-hoc manner • There are plenty of place for performance improvement and detailed analysis.

Q & A? • Thank you !!!

Star Challenge – multimedia search competition 2008