Music-driven Video Summarization System with Content-aware Mechanisms

A Music-driven Video Summarization System Using Content-aware Mechanisms應用內容知覺機制於音樂導向的視訊摘要系統 CMLab of CSIE, NTU 台大資工所通訊與多媒體實驗室研究生: 黃振修指導教授: 吳家麟博士

Outline • Introduction • The problem / Proposed solution • Related works • System framework • Media analysis • Audio/video analysis • Importance function • Synchronization (combining video with audio) • Profile: Rhythmic & Medium • Parameter: Sequential & non-Sequential • Demonstration • Experimental results • Conclusions & future work DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Introduction: The Problem • The digital video capture devices such as DVs are made more affordable for end users. • There’s still a tremendous barrier between amateurs (home users) and powerful video editing softwares (Adobe Premiere, CyberLink PowerDirector). • It’s interesting to shoot videos but frustrating to edit them. • Finally people leave their precious shots in piles of DV tapes without editing and management. DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Introduction: Users’ Impatience • According to a survey on DVworld*, the relations between the video length and how many times will user review them after days: • Video clips with no more then 5 minutes are best for human’s concentration. • People are impatient for videos without scenario or voice-over, especially for those with no music. *http://www.DVworld.com.tw/ DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Introduction: Proposed Solution • The music-driven video as summarization • One study at MIT showed that the improved soundtrack quality improve perceived video image quality. • Synchronizing video and audio segments enhance the perception of both. • Proposed solution • Create a musical video from home videos. • The synchronization is done by making the rhythm of the video fits that of the audio. Because of users’ direct sympathetic response to music, the created musical video is professional looking and more entertaining. DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Pick-up your video Select profiles to apply Produce a quality musical video Choose your favorite music Introduction: Related works • Literature: • Jonathan Foote, Matthew D. Cooper, Andreas Girgensohn, "Creating music videos using automatic media analysis," ACM Multimedia 2002: 553-560 • A consumer product called “muvee autoProducer” has been announced to ease the burden of professional video editing. • The content-analysis technologies are developed for years; can we use them to help auto-creation of musical videos? • The content-aware mechanisms DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Input video Audio Clips SelectedVideo shots Audio segment cutting Alignment Shot change Video Analysis Input music Output Video or Editing Script Audio rhythm & Video motion/color synchronization Scene selection Key shot selection System Framework Volume ZCR Brightness Bandwidth … Proposed Framework Human face Flash light Motion strength Color variance Camera Operation ... DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Clip boundary Attacks as sub-clip separation Volume Bandwidth Media Analysis: Audio Features • Frame-level features • Time-domain features • Volume: defined as the MSR of audio samples • ZCR: the number of times that the audio waveform crosses the zero axis in each frame. • Frequency-domain features • Brightness: the centroid of frequency spectrum • Bandwidth: the standard deviation of frequency spectrum 0s 30s 60s 90s DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Audio Analysis • Generally the brightness’ distribution curve is almost the same as that if the ZCR curve, so here we use ZCR feature only. • Bandwidth is an important audio feature but we can not easily tell what’s the real physical meaning of it in music when the bandwidth reaches its high/low values. • Furthermore, the relations between musical perceptual and bandwidth values are not clear and not regular. Brightness ZCR 12s 34s DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Audio Segmentation • First we cut the input audio into clips when the volume changes dramatically. • For each clip, we define the burst of ZCR as an “attack”, which may be beat of a base drum or voice of a singer. DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Clip boundary Attacks as sub-clip separation Volume Bandwidth Media Analysis: Audio Segmentation • The dramatic volume change defines the audio clip boundary, while the burst of ZCR (attack) in each clip defines the granular sub-segment within it. • Besides, we define the dynamic of an audio clip as our clip-level feature • Faster tempo music usually have clips with higher audio dynamics DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

DPixel > ThMAD No Not shot change Yes DHist > ThHist No Hand Trembling Yes Shot Change! Media Analysis: Video Analysis • First we apply shot change detection to segment video into shots • Here we use the combination of pixel MAD (Minimal Absolute Difference) and pixel histogram difference methods to detect shot change • The hybrid method performs well for home videos! DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Video Analysis • Shots Heterogeneity • Here we use MPEG-7 ColorLayout descriptor to measure each frame’s similarity • Used to measure video shot’s variety high heterogeneity low heterogeneity DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Pan Zoom Media Analysis: Camera Operation • Camera operations such as pan or zoom are widely used in amateur home videos. By detection those camera operations can help catching the video taker’s intention. • Our camera operation detection is performed on the basis of block based motion vectors. • This method is simple and efficient. Otherwise, no camera operation DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Video Features • High-level features Human face feature • Use the face detector in the OpenCV library • Face feature ratio Flashlight feature DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Video Features • Medium-level features • Medium-level features represent frames that are dynamic (higher motion activities) in nature. Motion Strength • Static frames tend to cause people lose their patience when watching videos Camera Motion Types • None, Pan, Zoom • Importance: Zoom > Pan > None DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Video Features • Low-level features • Modeling frames which are better to be seen, i.e., used for selecting high quality frames in the final production. Frame brightness (luminance) DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Video Features Color-variance • We use histogram distributions to model the color variances DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Importance Functions • Video frame-level importance: A “scaling factor”, Sa, defined with the accompanied audio clip’s dynamics (Adynamic) DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Importance Functions • Video segments with higher scores may have human faces resided in, or have higher motion strength, or contain zooms and pans; depending on which features that make them reach high values. DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Media Analysis: Importance Functions • Shot-level importance The shot-level importance is motivated by observing that: • Shots with larger motion intensity take longer duration. • The presence of face attracts viewer. • Shots of higher heterogeneity can taker longer playing time. • Shots with more camera operations are more important. • Of course, shots with longer length are more important. Static shots takes shorter, while dynamic shots can take longer. Gets better results after editing DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Synchronization: Profiles • General Properties of Home Videos The proposed four profiles DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

n’ shots Summarized Video’s Time-line Input Audio’s Time-line m clips Proposed summarized musical video Synchronization: Mechanisms • Before we talk about the synchronization process, first we introduce the video reduction rate Rva: • Basic synchronization mechanisms: Original Video’s Time-line n shots DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

BSU BSU BSU  Basic synchronization units Audio Time-line Clip boundaries Audio Attackes Synchronization: Rhythmic Profile • A basic synchronization unit (BSU) • consists of a starting time and a stopping time in audios • e.g., an audio segment starts from the 25th second to 31st second. • In medium profile, we use the LBSU, Larger BSU, which may be 2 or 3 BSU’s in length DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

121s 181s Audio timeline A BSU 25s 31s Synchronization: Rhythmic Profile • For each BSU, the starting and stopping points of BSU will be projected back to the video timeline. • Search the projected range to find candidate shots with the same length as BSU • We apply an audio scaling coefficient in the synchronization stage. The weight of motion intensity of video shot’s will be decreased when aligned with a slow audio clip; while nearly be preserved when synchronized with fast audio clip. Video timeline DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

25s LBSU 31s 1s 2s 2s 1s 121s 181s 6s Synchronization: Medium Profile • Each shot will be reassigned to a new length according to its shot importance, shots may becomes longer or shorter in proportion to the total length. • After projecting to the video space, the length budget is calculated according to the reduction rate; then allocate the budget to each inner shots according to its length. • If the allocated shot length is too short (< 30 frames), then its budget will be transfer to neighboring shots. Video timeline Audio timeline DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Demonstration: Sample Videos DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Experimental Results We have invited 20 people to join this subjective test, 10 of them are with computer science background and 10 of them are not. The users’ patience test result The performance result of music-driven summarization DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Experimental Results Answers about the comparisons of rhythmic and medium profiles Answers about the matching of video with audio tempos DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Conclusions • We have proposed and implemented a music-driven video summarization system that can help home users to post-process their creations in a fully automatic way. • Many content-aware mechanisms are also proposed to analyze the input media. We combine the input video and audio according to their content features to form our musical videos. • According to our subjective tests, all of the testers amaze about our system and feel very impressive. Most of the testers are glad to have such a tool to help them editing their creations. • Besides, our proposed system and content-aware mechanisms are also adopted by CyberLink Corp and have a planned commercialized scheduled. DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Future Work • It’s better to have users’ feedback, telling us which shots are “must have”; which shots are “better to have” and which shots should be dropped • In our work, we include proper transition effects between video shots. But we think the transition effect should consider both of the characteristics of the accompanying audio clip and video content. • By exploiting more audio and video features and having more understanding about digital contents’ semantics, we can get even better results and the automatic video editing system can get closer to professional editors DSP Group, CMLab, Dept. of Computer Science and Information Engineering, NTU

Music-driven Video Summarization System with Content-aware Mechanisms