Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Using Cross-Media Correlation for Scene Detection in Travel Videos Wei-Ta Chu ,Che-Cheng Lin ,Jen-Yu Yu

Outline • Introduction • Approach • Experiments • Conclusion

Introduction Why Use Cross Media Correlation for Scene Detection in Travel Video?? What Correlation between photos and video? More and more people get used to record daily life and travel experience both by Digital Cameras and Camcorders. (much lower cost in Camera and Camcorders)

Why Use Cross Media Correlation for Scene Detection in Travel Video?? What Correlation between photos and video? People often capture travel experience by still Camera and Camcorders. Massive home videos captured in uncontrolled environments, such as overexposure/underexposure and hand shaking. The content stored in photos and video contain similar information. Such as Landmark , Human’s Face.

Why Use Cross Media Correlation for Scene Detection in Travel Video?? • It’s Hard for direct scene detection in video. • High correlation between photo and video. • Photo obtain high quality data (scene detection is more easier).

Approach • What’s different purpose that people use photo and video even capture same things? • Photo To obtain high quality data , capture famous landmark or human’s face • Video To Capture evolution of an event Utilize the correlation so that we can succeed the works that are harder to be conducted in videos, but easier to be done in photos

FrameWork • To perform scene detection in photos: First we cluster photo by checking time information. • To perform scene detection in videos: First we extract several keyframe for each video shot, and find the optimal matching between photo and keyframe sequences

The idea of scene detection based on cross media alignment

The proposed cross-media scene detection framework Photos Time-based clustering Visual word representation DP-based Matching Scene boundaries Videos Shot change detection Filtering (motion blur cease ) Visual word representation Keyframe extraction This process not only reduces the time of cross-media matching, but also eliminates the influence of bad-quality image

Preprocessing • Scene Detection for Photos utilize different shooting time to cluster photo denote the time difference between the ith photo and the (i+1)-th photo as gi gi ＝ ti+1－ti K is an empirical threshhold D is the size of sliding window A scene change is claimed to occur between the nth and (n+1)-th photos. We set K as 17 and set d as 10 in this work.

Preprocessing • Use Global k-means algorithm to extract Keyframe • Detect and Filtering blur Keyframe . It’s no only reduces the time of cross-media matching, but also eliminates the influence of bad-quality images.

Visual Word Representation • Apply the difference-of-Gaussian(DoG) detector to detect feature points in keyframes and photos • Use SIFT(Scale-Invariant Feature Transform) to describe each point as a 128-dimensional feature vector. • SIFT-based feature vectors are clustered by a k-means algorithm , and feature points in the same cluster are claimed to belong to the same visual word

Visual Word Representation KeyFrames , Photos SIFT Feature point (Feature vector) K-means Visual Word

Visual Word Histogram Matching Xi denote the i th prefix of X, i.e., Xi＝<X1 ,X2,…, Xi> LCS(Xi,Yj) denotes the length of the longest common subsequence between Xi and Yj

Evaluation Data

Evaluation Metric The first term indicates the fraction of the current evaluated scene, and the second term indicates how much a given scene is split into smaller scenes. The purity value ranges from 0 to 1. Larger purity value means that the result is closer to the ground truth τ(si ,sj*) is the length of overlap between the scene si and sj* τ(si)is the length of the scene si T is the total length of all scenes

Performance in terms of purity based on different numbers of visual words, with different similarity thresholds

Performance based on four different scene detection approaches Hue Saturation Value

Conclusion For video, extract keyframe by global k-means algo. (Scen spot can be easily determined by time information of photos) Representingkeyframes and photo set by a sequence of visual word. Transform scene detection into a sequence matching algo.

Conclusion • By using a dynamic programming approach , find optimal matching between two sequence, determine video scene boundaries with the help of photo scene boundaries. By experiment on different travel video, different parameter settings, result shows that using correlation between different modalities is effective

Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Presentation Transcript

Cheng-Chi Yu

STCC Yu-Lin Eda Chang

Yu-Lin Eda Chang

Yu- cheng Lai , ChangJung Christian University, Taiwan

Yu-Chee Tseng, Jen-Jee Chen, and Yu-Li Cheng National Chiao Tung University, Taiwan

Presenter: Che-Yu Lin Advisor: Ming-Puu Chen Date: 06/15/2009

Huey-Wen Liang, Yaw-Huei Hwang, Lian-Yu Lin, Ta-Chen Su

GLCP Yu Cheng Elementary School Taipei, Taiwan

Wei-Cheng Lien 1 , Kuen-Jong Lee 1 and Tong-Yu Hsieh 2

Yi-An Lin and Yu-Te Lin

Intae Yu

Intae Yu

Weichi Hu , Chun Cheng Lin, Liang Yu shyu

Ting-Yu Lin and Jennifer C. Hou

Dr. Yu-Ling Cheng

Presenter: Che-Yu Lin Advisor: Min-Puu Chen Date: 04/27/2009

Cheng Yu & An-Ting Chang

Q Switch Laser - Dr Chua Cheng Yu

Yi-An Lin and Yu-Te Lin

Presenter: Che-Yu Lin Advisor: Min-Puu Chen Date: 01/19/2009

Jui -Chu Lin, Wei-Nien Su