230 likes | 320 Views
Star Challenge – multimedia search competition 2008. NUS.SIGIR group Luong Minh Thang & Zhao Jin WING group meeting – 12 Sep, 2008. Agenda. About StarChallenge Approaches Audio system Video system Results. Let’s start with a clip on Tai Chi!. The Star Challenge.
E N D
Star Challenge – multimedia search competition 2008 NUS.SIGIR group Luong Minh Thang & Zhao Jin WING group meeting – 12 Sep, 2008
Agenda • About StarChallenge • Approaches • Audio system • Video system • Results
The Star Challenge • International Competition organized by Singapore A*STAR • Focus on Multimedia Search by Voice and Video • Prize: • Free Trip to Singapore (blah!) • USD 100,000 (!!!)
The Tasks • Voice Search • AT1: Search by IPA (International Phonetic Alphabet) • AT2: Search by Example • AT3: Search for recurrent voice segments • Video Search • VT1: Search by (single) Query Image • VT2: Search by Video Shot • VT3: Scene/Event Categorization AT3 and VT3 replaced by integrated search in the end
Timeline • Mar 31: Registration Deadline • Registered as adMIRer • 5 members from NUS-SIGIR • 56 teams registered in total • June 18: 1st Knockout Round • AT1+AT2 • 8 Teams qualified
Timeline • July 18: 2nd Knockout Round • VT1+VT2 • 7 Teams qualified • September 4: Qualifying Race • All four tasks with Integrated Search • Only 5 Teams would qualify • October 23: Grand Final • On-site evaluation
Audio system – general approach • Use MFCC - well reflects speech • Use local alignment to align 2 sequences of audio & query • Using spectrogram, we cut up long audio into small segments for better matching. Short demo
Audio system – system overview Query audio files Test audio files Audio feature extractor Speech recognizer Test text Query MFCC vectors Test MFCC vectors Lucene indexing Query text Index data Query-test similarity matrix Alignment & matching Lucene matching Heuristic fusion Results
Audio system – Handle IPA • " i n t r ^ s t r ei t”: IPA query • Translate to CMU phonemes: IH N T R AH S T R EY T • INTEREST: IH N T R AH S T • RATE: R EY T • Query text: input to text module directly synthezied to audio file for audio module
Audio system – overall performance • Not have complete statistics yet, but AT2 (query by example) ~ 30-40% MAP, AT1 ~ 10 % • Let’s listen to a few queries …
Video system – VT1 categories • 11. Swimming pool, sports • 12. Closeup of hand, e.g. using mouse, writing, etc • 13. Business meeting (> 2 people), mostly seated down, table visible • 14. Natural scene, e.g. mountain, trees, sea, no pple • 15. Food on dishes, plates • 16. Face closeup, occupying about 3/4 of screen, frontal or side • 17. Traffic Scene, many cars, trucks, road visible • 18. Boat/Ship, over sea, lake • 19. PC Webpages, screen of PC visible • 120. Airplane • 1. Crowd (>10 people) • 2. Building with sky as backdrop, clearly visible • 3. Mobile devices including handphone/PDA • 4. Flag • 5. Electronic chart, e.g. stock charts, airport departure chart • 6. TV chart Overlay, including graphs, text, powerpoint style • 7. Person using Computer, both visible • 8. Track and field, sports • 9. Company Trademark, including billboard, logo • 10. Badminton court,
Video system - examples 16. Face closeup 9. Company trademark 2. Building with sky backdrop 3. Mobile devices
Video system – VT2 categories • 1. People entering/exiting door/car • 2. Talking face with introductory caption • 3. Fingers typing on a keyboard • 4. Inside a moving vehicle, looking outside • 5. Large camera movement, tracking an object, person, car, etc • 6. Static or minute camera movement, people(s) walking, legs visible • 7. Large camera movement, panning left/right, top/down of a scene • 8. Movie ending credit • 9. Woman monologue • 10. Sports celebratory hug
Video system – general approach Test files classifiers Classified cateogry Category filtering Query category Filtered test files Matching Query file Matched test files
Video system - Training data size Development data statistics • Dev = 10% labelled data, Train = 90% labelled data • Size varies significantly across different categories
Video system – classifier training Train key frames + categories Color extractor Edge extractor Face detector Layout extractor Color classifier Edge classifier Face classifier Layout classifier Color histogram (HSV, RGB) Edge histogram Num faces, size, positions Segmentation info Multi-class SVM training Dev key frames Color recall /categories Edge recall /categories Facerecall /categories Layout recall /categories Uses as weights
Classifer recall/categories • Uses as weights when fusing all different classifier • No miror analysis & n-fold testing yet
Video system – Category filtering & Matching Test video Test Key frames Motion extractor Color extractor Edge extractor Face detector Layout extractor motion histogram; camera & object motion Color histogram (HSV, RGB) Edge histogram Num faces, size, positions Segmentation info Color classifier Edge classifier Face classifier Layout classifier Classifier merger (weights from dev data) Heuristic category filtering Category filtering Query video/frames Query category Matching Filtered video Filtered key frames Results
Video system – motion 1 Camera: panning left Camera: panning up Object motion: static Object motion: moving
Video system – motion 2 • Check if most vector ~ 0 static motion • Otherwise, filter all small motion vectors • Categories motion vectors into circle bins • histogram. + main vector motion • If main vector motion dominates camera motion panning left, right, up, down • To detect zooming, find a focus block/point • Object motion is derived after removing camera motion
Conclusion • We have built up a full-function system within a short time and in an ad-hoc manner • There are plenty of place for performance improvement and detailed analysis.
Q & A? • Thank you !!!