Video Search Engines and Content-Based Retrieval

Video Search Engines and Content-Based Retrieval Steven C.H. Hoi CUHK, CSE 18-Sept, 2006

Outline • Video Search Engines • Content-Based Video Retrieval

Video Search EnginesA survey of state-of-the-arts

Introduction • Who are doing video search engines? Top text search engines 5.6 billion searches 07/2006

Introduction • Google

Introduction • Yahoo

Introduction • MSN/Live Search

Introduction • YouTube

Business Models • Web Advertising • Site Volume, or keyword customized • Video Ads • Disable controls (MSN) • Subscription • MLB, Real • Download to own • iTunes, Movie • Rental • Limited time, number of plays • Other • Desktop Media Search • Media player (jukebox) • Media Monitoring • Media Asset Management

Types of video Sites • Content Originators • Major Broadcasters • Affiliates, Local News • Major League Baseball • Syndication, Aggregation, “Internet Broadcasters” • Rental, purchase, advertising, subscription • MSN, Google, iTunes • ROO Media, FeedRoom • Movie and Video Download • Share portals • Consumer content, blogs • YouTube, Putfile, Vsocial, Google, Akimbo • Traditional Search Engines (Crawl) / “RSS” • Yahoo, Blinkx • Other • Public (Internet Archive) • Media Monitoring, asset management systems

Video Search Challenges

Current Video Search Engines Metadata • File type and context • Media file attributes • Size, length • Structured global metadata • RSS content description Content • Content Indexing • Search within a video • Full text of dialog • Image or video content • Automated Content Indexing

Current Video Search Engines • Content Search Engines Keyword search with transcripts from speech recognition

Content-Based Video Search Engine • Architecture

Content-Based Video Search Engine • Video Processing

Content-Based Video Search Engine • Research Challenges • Speech Recognition • Shot Boundary Detection • Video Story Segmentation • Concept Detection • Multi-modal Fusion for Ranking • Text/ASR, Audio/Speech, Visual, etc.

Content-Based Retrieval • Our Research Problem • Learning to rank video shots for automatic content-based search tasks ! • Challenges • Multi-Modal Information Fusion • Small Sample Learning (a few pos. & no neg.) • Learning on large-scale datasets

Multi-modal and Multi-scale Ranking Framework • Main Ideas • Representing video structures by graphs • Using semi-supervised learning to address small labeled sample learning problem • Fusing Multi-modal information by Harmonic learning over graphs • Multi-scale ranking for achieving efficient performance on large-scale datasets

Multi-modal and Multi-scale Ranking Framework • Graph-based Modeling Shot StoryText chinese state president hu jintao arrived at the located sao paulo suburbs... place of innai natural china in this country policies and regulations... hu jintao during his speech said in argentina are latin america in the region ...

Multi-modal and Multi-scale Ranking Framework • Semi-Supervised Learning on Graph • To find an optimal real-valued function g: VR on the graph G • To minimize a quadratic energy function: • Using Gaussian field and Harmonic property of Spectral Graph Theory (J. Zhu’s ICML’03), a harmonic function g can be found:

Multi-modal and Multi-scale Ranking Framework • Semi-Supervised Learning on Graph • Let • The solution of the harmonic function g can be expressed in matrix operations:

Multi-modal and Multi-scale Ranking Framework • Multi-Modal Fusion over Graph • To combine text information into SSL on visual modality, we consider the text inputs as the attached nodes on the visual graph: Visual - g Text - f

Multi-modal and Multi-scale Ranking Framework • Challenges • Number of examples in database: N is large • For examples: • TRECVID 2005: Rep. Key-Frames N = 45,765 • TRECVID 2006: Rep. Key-Frames N = 79,487 • How to do Semi-Supervised Learning?!

Multi-modal and Multi-scale Ranking Framework • Multi-Scale Ranking • Learning ranking through multi-scale reranking • Each stage is associated with different computational costs • In our solution, four ranking stages include: • Ranking by Text Retrieval using Language Models • Re-ranking by NN fusing Text and Visual • Re-ranking by SVM fusing Text and Visual • Re-ranking by multi-modal Semi-supervised Learning

return top K shots Top N4 related Shots User’s Query SSR Semi-Supervised Ranking Raw Video Clips / Streams Top N3 related Shots SVM/KLR Supervised Ranking Top N2 related Shots Multi-scale Ranking Text + Visual NN Top N1 related Shots Top M related Stories Text Text Processing Video Stories Video Shots VideoProcessing Image Processing Multi-modal Fusion

Benchmark Evaluations • Dataset • TRECVID 2005 • Test: 140 video clips, 45,765 rep. key frames • 24 queries • A query example: <videoTopic num="0152"> <textDescription text="Find shots of Hu Jintao, president of the People's Republic of China" /> </videoTopic>

Benchmark Evaluations • Text-only Retrieval • No Pseudo-Relevance Feedback (No-PRF) • With Pseudo-Relevance Feedback (PRF) Language Models • TF-IDF • Okapi • KL-JM • KL-DIR • KL-ABS

Benchmark Evaluations • Visual Features • Color • Grid Color Moment • 3*3 grid, 81-dimensions • Edge • Edge Direction Histogram • 36 bin+1, 37-dimensions • Texture • Gabor Moments • 5*8=40, 3 moments,120 dimensions • 238 dimensions in total COREL Benchmark Photos

Benchmark Evaluations • Multi-modal Retrieval (Text + Visual) • Text-only retrieval • Text + NN (Text + Visual) • Text + SVM (Text + Visual) • MMMS (Text + Visual)

Benchmark Evaluations • Evaluation Results Average Performance on TRECVID 2005 Dataset

Benchmark Evaluations • Comparison with other approaches Average performance of 24 queries

Related Work • IBM Solution • SVM + NN + Multiple Instance Learning • Columbia solution • Information-Theoretical Clustering Approach • CMU Solution • Query-Class Dependent Weighting Ranking

Conclusion • A tutorial of video search engines • Research contributions • A Unified framework of Multi-Modal and Multi-Scale Ranking for video retrieval • Graph-based Modeling of video structures • Semi-Supervised Learning for Multimodal Ranking • Making SSL practical for large-scale problems • Promising empirical results…

Future Work • Research is in progress, tough ahead… • Any suggestions or comments?

Video Search Engines and Content-Based Retrieval