Video Indexing and Modeling

Video Indexing and Modeling • Physical Feature based Indexing • Semantic Content based Modeling

Video Units Shot: Frames recorded in one camera operation form a shot. Scene: One or several related shots are combined in a scene. Sequence: A series of related scenes forms a sequence. Video: A video is composed of different story units such as shots, scenes, and sequences arranged according to some logical structure (defined by the screen play). These concepts can be used to organize video data.

Physical Feature based Modeling • A Video is represented and indexed based on audio-visual features • Features can be extracted automatically • Queries are formulated in terms of color, texture, audio, or motion information • Very limited in expressing queries close to human thinking • It would be difficult to ask for a video clip showing the sinking of the ship in the movie “Titanic” using only color description. • However, it is our future.

A Video-Tracking Approach to Browsing and Searching Video Data

Applications of VDBMS • digital libraries, • distance learning, • electronic commerce, • movies on demand, • public information systems, etc.

Video Data is Complex • Cannot use ‘tuple’ or ‘record’ • Need video data segmentation techniques • Alphanumeric order does not work • Need content-based indexing techniques • Exploring or browsing mode to confirm similarity matching • Need browsing mechanism

Units for Video Data • Shot is a sequence of frames recorded in one camera operation. • Shot is the unit for handling video data. • Sceneis a sequence of semantically related shots

Shot Boundary Detection(Content-based Segmentation) • Pixel Matching • Color Histogram • Edge Change Ratio

Existing SBD Techniques • Pixel Matching • Comparing corresponding and neighboring pixels. • Sensitive to noise and object movement. • Color Histograms • Count pixels of each color. • Cannot distinguish images with different structures, but similar color distribution. • Edge Change Ratio • Counting entering and exiting edge pixels. • Very expensive; the user must provide many parameters, e.g., matching distance.

Drawbacks of Existing SBD Techniques • Difficult to determine many input parameters. • Accuracy varies from 20% to 80% [Lienhart99]. • Computation is expensive.

Our TechniqueFocusing on Background • Observation: Shot is a sequence of frames recorded from a single camera operation. • Idea: Tracking camera motion is the most direct way to detect shot boundaries. • Technique: Tracking the background area of the video.

Another Reason for Focusing on Background • Frames consist of Object(s) and Background. • While objects change, background does little. # 234 # 253 # 271 # 283 # 296

Fixed Background & Object Areas

Facilitate Comparison(Transform FBA to TBA) TBA 282

Background Tracking (BGT) • Max. number of continuous matches is the score. • How much they share the common background. #282 #283 TBA 282 TBA 283

How Does FBA Track Camera Movement

Signature & Sign • ‘Gaussian Pyramid’ was used to reduce an image to a smaller size. • We adapted it to reduceTBA and FOA to their ‘signature’ and ‘sign’.

Dimensions b

Background Tracking (BGT) • Max. number of continuous matches is the score. • How much they share the common background.

Shot Detection Procedure

A Shot Detected by BGT # 234 # 253 # 271 # 283 # 296 Edge Change Ratio and Color Histogram incorrectly detect this sequence as two shots.

Test Set • BGT : Background Tracking • CHD : Color Histogram Difference • ECR : Edge Change Ratio

Performance Metrics • Recall: The ratio of the number of shots detected correctly over the actual number of shots. • Precision: The ratio of the number of shots detected correctly over the total number of shots detected.

BGT is Better

BGT is less sensitive(Matching only the Sign bits) Ds CHD ECR Threshold

Advantages ofFocusing on Background • Less sensitive to thresholds • More reliable • Less computation

Non-Linear Approach to Shot Boundary Detection

SBD is very expensive • Consider a video clip with 30 frames/second, 320 x 240 pixels, 3 bytes/pixel. • We need to deal with more than 2 Gigabytes for only 5 minutes of video. • This is too much for large-scale applications.

What can we do? • Existing techniques can be regarded as linear approach since they need to compare every two consecutive frames. • Observations: • There is little difference between consecutive frames within a shot. • Frames do not have to be examined in the order they appear in the video clip.

Regular Skip • Compare every other d th frames • Reduce the number of comparisons substantially

Adaptive Skip Optimal d varies from video to video • Determine d dynamically

Binary Skip • Compare two end frames, • Divide the frame sequence in half, and • Compare the new end frames of each sequence until the terminal conditions • Significant Savings compared to Linear

Experiments • Compare - Linear, Regular, Adaptive, and Binary. • Two metrics : - Total Execution Time and - Total Number of Frame Comparisons. • Color Histogram Method is used to compute inter-frame difference.

Comparisons of Costs

Example # 1 # 1500 # 2952 # 2953 # 3184 # 3185 Avg. Shot length is very long Around 98% savings

Example # 1581 # 1610 # 1611 # 1643 # 1644 # 1669 Avg. Shot length is very short Around 60% savings

Concluding Remarks • We introduced the Non-linear approach to SBD • We proposed three Non-linear techniques • Our experiments show this novel idea can improve SBD performance - up to 71 times, and - 5 times on average.

Scene Hierarchy • Shot is a good unit for video abstraction. • Scene is often a better unit to convey the semantic meaning. • A scene hierarchy allows browsing and retrieving video information at various semantic levels.

Existing Schemes for Scenes • Ignore the video content, e.g., At each level, a segment is evenly divided into smaller segments for the next level [Zhang95]. • Limited to low-level scene construction, e.g., video-scene-group-shot hierarchy is used in [Rui99]. • Reliance on explicit models of scene boundary characteristics, e.g., CNN "Headline News" begins and ends with a computer graphics shot of flying letters reading "Headline News” [Zhang94].

Scene Tree • We group neighboring shots that are related (i.e., sharing similar background) into scenes. • Scenes with related shots are assembled into a higher-level scene of arbitrary level. Note:The size and shape of a scene tree is determined by the semantic complexity of the video.

Related Shots • Two shots are ‘related’ if they share at least one TBA (by comparing the signs) e.g., shots in a telephone conversation scene are related.

Scene Tree Building

An Automatically Generated Scene Tree We can use more than one keyframe for higher-level scenes. A keyframe to capture the semantics of the video segment

Our Environment • We segment each video into shots using a camera tracking technique. • Efficient and reliable • For each video, we apply a fully automatic method to build a browsing hierarchy. • Scene tree reflects the semantic complexity of the video • We use the sign bits to build a content-based indexing mechanism to make browsing more efficient. • Fully automatic and effective • Not reliant on expensive image processing techniques.

Video Similarity Model • Compute for the query shot q. • Compute for a video shot i. • Shot i is similar if it satisfies: •  and  can be set to control the size of the result set. • Scenes with a certain percentage of matching shots are considered similar.

A Matching Example Query shot Matching shot 1 Matching shot 2 Matching shot 3 Interested in a shot ? It can be used as a query to look for other shots with similar characteristics (e.g., talking head in a dark area).

Another Matching Example Matching characteristics: Two people are talking from some distance. Query shot Matching shot 1 Matching shot 2 Matching shot 3

A Video Search Engine • A Web agent or spider traverses the Web to detect videos, retrieves, and forwards them to the search engine. • Keywords can be extracted in two ways: • Directory and file names in URL’s typically contain meaningful terms, e.g., http://www.fpl.com/environment/endangered/contents/manatee.shtml • Most videos include a short text description. • Scene Tree is used for browsing and searching within a video clip.

Video Indexing and Modeling