1 / 24

The ViBE Video Database System: An Update and Further Studies

Explore how ViBE transforms video access with shot transition detection, tree representation, semantic labeling, and active browsing. Discover the framework for scalable video data management.

mikeburns
Download Presentation

The ViBE Video Database System: An Update and Further Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ViBE Video Database System:An Update and Further Studies C. M. Taskiran, C. A. Bouman, and E. J. Delp Purdue University School of Electrical and Computer Engineering Video and Image Processing Laboratory (VIPER) Electronic Imaging System Laboratory (EISL) West Lafayette, Indiana

  2. Outline • The content-based video access problem • The ViBE system • Temporal Segmentation and the Generalized Trace • Results • Conclusions The main result of this paper is a extended experimental study of the performance of our shot transition detection techniques

  3. The Problem • How does one manage, query, and browse a large database of digital video sequences? • Problem Size • One hour of MPEG-1 is 675MB and 108,000 frames. • Goal : browse by content (how do you find something) • applications include digital libraries • Need for compressed-domain processing

  4. ViBE: A New Paradigm for Video Database Browsing and Search • ViBE has four components • shot transition detection and identification • hierarchical shot representation • pseudo-semantic shot labeling • active browsing based on relevance feedback • ViBE provides an extensible framework that will scale as the video data grows in size and applications increase in complexity

  5. ViBE System Components Active Browsing Environment Detection and classification of shot boundaries Video sequences Shot Tree Representation User Pseudo-semantic labeling of shots

  6. Frame by frame dissimilarity features Feature Processing list of scene changes Current Paradigm in Temporal Segmentation of Video dissimilarity measure Compressed Video Sequence frame number

  7. Tree Representation of Shots • Single keyframe is not adequate for shots with large variation • Agglomerative clustering is used to build tree representation for shots • Shot similarity is obtained using tree matching algorithms

  8. Pseudo-Semantic Labeling • Bridge between high-level description and low-level representation of a shot • Uses semantic classes which can be derived from mid- and low-level features, improves the description of the image content • Examples: • head and shoulders • indoor/outdoor • high action • man made/natural For each shot we will extract a vector indicating the confidence of each pseudo-semantic label

  9. Color (UV) Segmentation EM Model Estimation Color (YUV) Segmentation General models for skin and background Head and Shoulders Feature Label • From a shot-based point of view, we want to indicate if there is a talking head in a shot • The first goal is to extract skin-like regions from each frame • With motion and texture information, each region along the shot will be labeled as a face candidate or not

  10. Face Extraction

  11. Browsing with a Similarity Pyramid • Organize database in a pyramid structure • Top level of pyramid represents global variations • Bottom level of pyramid represents individual images • Spatial arrangement makes most similar images neighbors • Embedded hierarchical tree structure

  12. Navigation via the Similarity Pyramid Zoom in Zoom out Zoom in Zoom out

  13. Browser Interface Control Panel Similarity Pyramid Relevance Set

  14. dissimilarity measure frame number Common Approach in Temporal Segmentation of Video Extraction of a frame-by-frame dissimilarity feature Compressed video sequence Extraction of DC frames Processing Shot boundary locations

  15. Problems With This Approach • What type of feature(s) should be used? • How do we choose threshold(s) robustly? • Classes (cut vs. dissolve, etc.) may not be separable using one simple feature Using a multidimensional feature vector may alleviate these problems.

  16. The Generalized Trace (GT) • The GT is a feature vector, , which is extracted from each DC frame in the compressed video sequence • DC frames for P and B frames are estimated using motion vector information • Includes features which are readily available from an MPEG stream with minimal computation • The use of the GT to detect cuts was described in Taskiran and Delp, ICASSP’98 • Binary regression tree classifier • Gelfand, Ravishankar, and Delp, PAMI, Feb. 1991

  17. List of Features • The GT feature vector consists of • g1 : Y component • g2 : U component • g3 : V component • g4 : Y component • g5 : U component • g6 : V component • g7 : Number of intracoded MB’s • g8 : Number of MB’s with forward MV • g9 : Number of MB’s with backward MV • g10 - g12 : Frame type binary flags histogram intersections frame standard deviations Not applicable to all frames

  18. Detection of Cuts Tree training Ground truth sequences Regression tree GT extraction and windowing Sequence to be processed Postprocessing Cut locations

  19. Detection of Gradual Transitions First tree GT extraction and windowing Sequence to be processed Windowing Second tree Postprocessing Gradual transition locations

  20. The Data Set • Digitized at 1.5Mb/sec in CIF format (352x240) • Contains more than 4 hours of video (35 hours in general) • 5 different program genres • 10 min clips were recorded at random points during the program and commercials were edited out • A single airing of a program is never used to obtain more than one clip (except movies)

  21. #frames #cuts #dissolves #fades #others soap opera 67582 337 0 0 2 107150 108 6 talk show 331 1 78051 173 sports 45 0 29 news 58219 297 7 0 6 movies 54160 262 15 6 1 cspan 90269 95 19 0 0 TOTAL 455431 1495 196 7 42 Data Set Statistics

  22. Experiments • Use a cross-validation procedure to determine performance for each genre G {soap, talk, sports, news, movies, cspan} for i = 1 to 4 randomly choose S1 and S2, both not in G train regression tree using S1 and S2 process all sequences in G using this tree average performance over G average the four stes of values to find performance for G • Window size = 3 frames; threshold = 0.35

  23. Results Tree Classifier Sim. Thresholding Sliding Window Detect Detect Detect FA FA MC MC FA MC soap 0.941 13.3 0.916 99 0 0 0.852 24 0 0.942 talk 32.3 0.950 45 7.5 0.968 1 171 15 0.939 82.5 sports 0.785 59 34.8 1 0.925 251 73 news 0.958 38.0 0.886 61 0.75 0 0.926 212 1 movies 0.821 43.3 0.856 25 2 0 0.816 25 3 cspan 0.915 54.3 0.994 40 8.5 0 0.943 3 20

  24. Conclusion • Double stage tree classifier provides a good framework for detection of dissolves and fades • Pseudo-Semantic labeling is a fast, feature-based approach to content description • Relevance feedback coupled with the similarity pyramid forms a powerful active browsing tool • Performance of the regression tree classifier is very robust to changes in program content. • The performance is not affected much by the choice of the training sequences.

More Related