190 likes | 304 Views
Evaluating Audio Skimming and Frame Rate Acceleration for Summarizing BBC Rushes. Mike Christel, Wei-Hao Lin, and Bryan Maher {christel, whlin, bsm}@cs.cmu.edu School of Computer Science Carnegie Mellon University. CIVR July 8, 2008. Talk Outline.
E N D
Evaluating Audio Skimming and Frame Rate Acceleration for Summarizing BBC Rushes Mike Christel, Wei-Hao Lin, and Bryan Maher {christel, whlin, bsm}@cs.cmu.edu School of Computer Science Carnegie Mellon University CIVR July 8, 2008
Talk Outline • TRECVID 2007 BBC Rushes Summarization Task • Look at a few Video Summarizations • Assessment Procedure: Are they any good? • First Study: 25x, cluster, pz • Second Study (focus on acceleration): 25x, 50x, 100x, pz • Discussion
TRECVID 2007 BBC Rushes Summarization • Video summary is “a condensed version of some information, such that various judgments about the full information can be made using only the summary and taking less time and effort than would be required using the full information source” • Maximum 4% duration • Benefits of this TRECVID task: provides a reasonably large video collection to be summarized, a uniform method of creating ground truth, and a uniform scoring mechanism
BBC Rushes • 42 test videos (+ development ones) from BBC Archive • Test videos: • minimum duration 3.3 minutes • maximum duration 36.4 minutes • mean duration 25 minutes • Raw (unedited) rush video with a great deal of redundancy (repeated takes), mixed quality audio, “junk” frames
Assessment (Text Inclusions of Prior Slide) • pan left to right around table with five people eating dinner • pan right to left around table with five people sitting talking • curly haired man stands up from the table • closeup of grey haired lady, dinner table not visible • grey haired lady across dinner table, green wine bottle visible in foreground • grey haired lady across dinner table, camera pans right • grey haired lady across dinner table, green wine bottle not visible in foreground • partial view of person to the right talking to grey haired lady across dinner table • closeup of short haired man sitting, without his hands clasped together • closeup of blonde lady as she stands up, there is a fire in the background • closeup of curly haired man without a hand on his face • closeup of curly haired man as he stands up
Assessment Metrics • Duration (DU, <= 4% of the target video) • Assessor time-on-task (TT) judging which ground truth segments were included in the summary • The fraction of listed text segments from the full video included in the summary as judged by assessor (IN) • Ease of use to find desired content (EA) • How redundant was the summary (RE) • …ideal summary would have the smallest DU and TT necessary to achieve sufficient IN performance with high user satisfaction based on subjective EA and RE
First Study: cluster, 25x, pz • cluster: based on iterative color clustering with junk frame removal, backfilling of unused space and audio coherence • pz: cluster-based, but use domain knowledge that “pans/zooms” are important to keep pan or zoom sequences in 1-3 second runs as representing clusters • 25x: select every 25th frame of target video to produce 4% (1/25) video summary with apparent 25x playback (use same coherent audio as with cluster) – note that no junk frame filtering is used
Participants and Results, Study 1 • 4 CMU students and staff following the NIST procedure
Study 1 Discussion • 25x excellent method to produce summary for high IN • TT metric for 25x also high • RE metric poor for 25x (but inversely related to IN…) • EA for 25x better than cluster (perhaps helped by audio) • Subjective metrics TT, RE, and EA best for pz
Question Leading to Study 2 How fast is too fast? (see [Wildemuth 2003] cited in paper) 25x? 50x? 100x? Will “pz” differentiate from these?
Second Study: 25x, 50x, 100x, pzA • 25x: as before (every 25th frame, coherent audio) • 50x: select every 50th frame of target video to produce 2% (1/50) video summary with apparent 50x playback • 100x: select every 100th frame, 1% summary • pzA: as before but with audio same as 25x audio, filled to be a 4% summary
Participants and Results, Study 2 15 subjects (8 female, 7 male; age range [21, 35] with average age 25.7) following the NIST procedure
Study 2 Discussion 25x excellent method to produce summary for high IN (0.73) 50x also excellent for high IN (0.68), >> pzA and 100x TT metric for 25x also high: 25x and pzA (the two 4% summaries) both significantly slower than 50x and 100x RE metric shows 25x worse than pzA EA for 100x worse than others (100x has fastest TT) 50x produces excellent IN performance at 2/3 the time cost (TT) of 25x 100x too fast: IN significantly worse than 50x, EA poor
Discussion • We believe inclusion of audio narrative along with sped-up video made 25x and 50x more playable; at 100x the audio becomes too short/choppy to contribute well • 15 subjects for Study 2 not as careful as NIST or Study 1 assessors, e.g., TT of 77.5 vs. 110 or 102 seconds • If these 15 better reflect true users, time savings important (and hence TT is important metric) • How will 50x hold up as a baseline? (To be discussed in the context of TRECVID 2008 BBC rushes summarization task – it does well on IN, poor on TT, RE)
Conclusions • For BBC rushes, 50x works quite well • Domain knowledge (here, attempting to preserve pans/zooms) did not distinguish itself • Improve detector for “significant” pans/zooms • Sacrifice coverage for pan/zoom inclusion • Interactive summary control an area of promise, e.g., 50x until neighborhood of interest found, then pz to see pans/zooms and more detail Thanks to NIST, BBC, and TRECVID organizers for making this investigation possible. This work supported by the National Science Foundation under Grant Nos. IIS-0205219 and IIS-0705491