100 likes | 214 Views
Lessons Learned From Building a Terabyte Digital Video Library. Presented by Jia Yao Multimedia Communications and Visualization Laboratory Department of Computer Engineering & Computer Science University of Missouri-Columbia Columbia, MO 65211.
E N D
Lessons Learned From Building a Terabyte Digital Video Library Presented by Jia Yao Multimedia Communications and Visualization Laboratory Department of Computer Engineering & Computer Science University of Missouri-Columbia Columbia, MO 65211
Lessons Learned From Building A Terabyte Digital Video Library • Howard D. Wactlar, Michael G. Christel, Yihong Gong and Alexander G. Hauptmann, “Lessons Learned From Building A Terabyte Digital Video Library” IEEE Computer, vol. 32, no. 2, pp. 66-73, Feb. 1999. • Informedia Project, at Carnegie Mellon University, begun in 1994, was one of six funded by the US national Science Foundation, the US defense Advanced Research Projects Agency, and the National Aeroautics and Space Administration, under the US digital Library Initiative.
Lessons Learned From Building a Terabyte Digital Video Library • Challenges of building such a video library: • how to embed information • how to handle the voluminous file size • how to deal with the temporal characteristic of video • This paper talks about: • automatically extracting information from digitized video • creating interfaces that allowed users to search for and retrieve videos based on extracted information • validating the system through user testbeds
Video Processing • Two types of video data: news video and documentary video • Video retrieval is done by using integrated speech processing, image processing and information retrieval techniques • Speech processing: • use CMU sphinx speech recognition system to generate a complete transcript of the speech in video • Sphinx’s word error rate is inversely proportional to the amount of processing time, by running the algorithm on parallel machines, it gives excellent result in two to three times real speech time • error rate can be further lowered by using general language models
Video Processing • Information retrieval: • although word error rate is 30% high, information retrieval precision and recall were degraded only less than 10% • redundancy in language helps the retrieval of video based on speech recognition • use of phonetic transcription will also help to reduce error rate • provide more match candidates: possible matching object will not lost • problem: training based on small amount of speech data might not be sufficient • problem: errors in the automatic partitioning of video streams into video segments may affect information retrieval effectiveness
Video Processing • Image processing • the task is to fully characterize the scene and all objects within it and to provide efficient, effective access to this information through indices and similarity measures • currently image processing techniques are used to • partition each video segment into shots, choose a representative frame (key frame) for each shot -- usually the middle frame in one shot, but also can be the last one frame if the shot contains a camera motion • shot: a video clip recorded with one continuous camera operation • segment: several shots describing a topic • identify and index features to support image similarity matching -- detect face region in news video; retrieval by interest region and color • create metadata (metadata:data to describe the structure of raw data) derived from imagery for use in text matching -- Video OCR, first find the caption in video, then use OCR software to extract the text in it
Video Processing • future challenges: • content-based retrieval • effective segmentation of video into segments, and then break story into shots -- need to use transcript information (such as closed caption) and language model to help segmentation
Informedia Interface • Because the underlying speech, image and language processing (formerly referred to as information retrieval in the paper) are imperfect and produce ambiguous, incomplete metadata, powerful browsing capabilities are essential in a multimedia information retrieval system • The use of headlines: • search result contains several video segments with thumbnail images displayed. Headline of one segment will pop up when mouse was moved over the thumbnail images • phrases are evaluated using statistical approach, then used as components of the headlines • significant information is given first in headline, followed by explanation • segment size and record date are provided
Informedia Interface • The use of the thumbnail: • thumbnail of each segment has to be chosen very carefully in order to get good performance • use the key frame of first shot in the segment -- result not good • use the key frame of the most related shot in the segment -- good • The use of filmstrip: • key frames from a segments’ shots can be presented in sequential order as filmstrips. • quickly shows the content of one segment • match bar (shows matched query word) let user determine location of interest fast, directly jump to that point and start playback • transcripts can help hearing-impaired people
Informedia Interface • The use of skims: • a skim incorporates both video and audio information from a longer source so that a two minute skim can represent 20 minute original video • originally, use subsampled video as skim -- performance not very good • improvement: select frames based on phrases rather than words in transcript • Conclusion: • interaction is a important part of digital video library • integrated audio, image and language search can help to reduce the limitations of individual methods • phrase plays a important role in speech understanding