The ViBE Video Database System: An Update and Further Studies

The ViBE Video Database System:An Update and Further Studies C. M. Taskiran, C. A. Bouman, and E. J. Delp Purdue University School of Electrical and Computer Engineering Video and Image Processing Laboratory (VIPER) Electronic Imaging System Laboratory (EISL) West Lafayette, Indiana

Outline • The content-based video access problem • The ViBE system • Temporal Segmentation and the Generalized Trace • Results • Conclusions The main result of this paper is a extended experimental study of the performance of our shot transition detection techniques

The Problem • How does one manage, query, and browse a large database of digital video sequences? • Problem Size • One hour of MPEG-1 is 675MB and 108,000 frames. • Goal : browse by content (how do you find something) • applications include digital libraries • Need for compressed-domain processing

ViBE: A New Paradigm for Video Database Browsing and Search • ViBE has four components • shot transition detection and identification • hierarchical shot representation • pseudo-semantic shot labeling • active browsing based on relevance feedback • ViBE provides an extensible framework that will scale as the video data grows in size and applications increase in complexity

ViBE System Components Active Browsing Environment Detection and classification of shot boundaries Video sequences Shot Tree Representation User Pseudo-semantic labeling of shots

Frame by frame dissimilarity features Feature Processing list of scene changes Current Paradigm in Temporal Segmentation of Video dissimilarity measure Compressed Video Sequence frame number

Tree Representation of Shots • Single keyframe is not adequate for shots with large variation • Agglomerative clustering is used to build tree representation for shots • Shot similarity is obtained using tree matching algorithms

Pseudo-Semantic Labeling • Bridge between high-level description and low-level representation of a shot • Uses semantic classes which can be derived from mid- and low-level features, improves the description of the image content • Examples: • head and shoulders • indoor/outdoor • high action • man made/natural For each shot we will extract a vector indicating the confidence of each pseudo-semantic label

Color (UV) Segmentation EM Model Estimation Color (YUV) Segmentation General models for skin and background Head and Shoulders Feature Label • From a shot-based point of view, we want to indicate if there is a talking head in a shot • The first goal is to extract skin-like regions from each frame • With motion and texture information, each region along the shot will be labeled as a face candidate or not

Face Extraction

Browsing with a Similarity Pyramid • Organize database in a pyramid structure • Top level of pyramid represents global variations • Bottom level of pyramid represents individual images • Spatial arrangement makes most similar images neighbors • Embedded hierarchical tree structure

Navigation via the Similarity Pyramid Zoom in Zoom out Zoom in Zoom out

Browser Interface Control Panel Similarity Pyramid Relevance Set

dissimilarity measure frame number Common Approach in Temporal Segmentation of Video Extraction of a frame-by-frame dissimilarity feature Compressed video sequence Extraction of DC frames Processing Shot boundary locations

Problems With This Approach • What type of feature(s) should be used? • How do we choose threshold(s) robustly? • Classes (cut vs. dissolve, etc.) may not be separable using one simple feature Using a multidimensional feature vector may alleviate these problems.

The Generalized Trace (GT) • The GT is a feature vector, , which is extracted from each DC frame in the compressed video sequence • DC frames for P and B frames are estimated using motion vector information • Includes features which are readily available from an MPEG stream with minimal computation • The use of the GT to detect cuts was described in Taskiran and Delp, ICASSP’98 • Binary regression tree classifier • Gelfand, Ravishankar, and Delp, PAMI, Feb. 1991

List of Features • The GT feature vector consists of • g1 : Y component • g2 : U component • g3 : V component • g4 : Y component • g5 : U component • g6 : V component • g7 : Number of intracoded MB’s • g8 : Number of MB’s with forward MV • g9 : Number of MB’s with backward MV • g10 - g12 : Frame type binary flags histogram intersections frame standard deviations Not applicable to all frames

Detection of Cuts Tree training Ground truth sequences Regression tree GT extraction and windowing Sequence to be processed Postprocessing Cut locations

Detection of Gradual Transitions First tree GT extraction and windowing Sequence to be processed Windowing Second tree Postprocessing Gradual transition locations

The Data Set • Digitized at 1.5Mb/sec in CIF format (352x240) • Contains more than 4 hours of video (35 hours in general) • 5 different program genres • 10 min clips were recorded at random points during the program and commercials were edited out • A single airing of a program is never used to obtain more than one clip (except movies)

#frames #cuts #dissolves #fades #others soap opera 67582 337 0 0 2 107150 108 6 talk show 331 1 78051 173 sports 45 0 29 news 58219 297 7 0 6 movies 54160 262 15 6 1 cspan 90269 95 19 0 0 TOTAL 455431 1495 196 7 42 Data Set Statistics

Experiments • Use a cross-validation procedure to determine performance for each genre G {soap, talk, sports, news, movies, cspan} for i = 1 to 4 randomly choose S1 and S2, both not in G train regression tree using S1 and S2 process all sequences in G using this tree average performance over G average the four stes of values to find performance for G • Window size = 3 frames; threshold = 0.35

Results Tree Classifier Sim. Thresholding Sliding Window Detect Detect Detect FA FA MC MC FA MC soap 0.941 13.3 0.916 99 0 0 0.852 24 0 0.942 talk 32.3 0.950 45 7.5 0.968 1 171 15 0.939 82.5 sports 0.785 59 34.8 1 0.925 251 73 news 0.958 38.0 0.886 61 0.75 0 0.926 212 1 movies 0.821 43.3 0.856 25 2 0 0.816 25 3 cspan 0.915 54.3 0.994 40 8.5 0 0.943 3 20

Conclusion • Double stage tree classifier provides a good framework for detection of dissolves and fades • Pseudo-Semantic labeling is a fast, feature-based approach to content description • Relevance feedback coupled with the similarity pyramid forms a powerful active browsing tool • Performance of the regression tree classifier is very robust to changes in program content. • The performance is not affected much by the choice of the training sequences.

The ViBE Video Database System: An Update and Further Studies