Associating Video Frames with Text

Associating Video Frames with Text Pinar Duygulu and Howard D. Wactlar Informedia Project Carnegie Mellon University ACM SIGIR 2003

Abstract • Integration of visual and textual data • in order to annotate video frames with more reliable labels and descriptions • Correspondence problem between video frames and associated text • Using joint statistics • Better annotations can improve the performance of text based queries

Introduction (1/2) • Video retrieval • visual vs. textual features • A system that combines two features is more powerful • images/videos with some descriptive text • Corel data set, some museum collection and news photographs on the web with captions • Correspondence problem • Some methods are proposed by modeling the joint statistics of words and image regions

Introduction (2/2) • Correspondence problems in video data • Because transcripts and frames may not be co-occur in the same time • e.g. query = president to Informedia system • Goal • Determine the correspondence between the video frames and associated text to annotate the video frames with more reliable descriptions

Multimedia Translation (1/3) • Analogy • learning a lexicon for machine translation vs. learning acorrespondence model for associating words with image regions • Missing data problem • Assuming unknown one-to-one correspondence between words, missing data is the major problem by using joint probability distribution linking words in two languages • deal with by the EM algorithm

Multimedia Translation (2/3) • Method • a set of images and a set of associated words • image segmented into regions and from each region a set of feature (color, texture, shape and position and size) are extracted. • Vector-quantize the set of features representing an image region using k-means • Each region then gets a single label (blob token) • Then construct a probability table that links the blob tokens with word tokens

Multimedia Translation (3/3) • Method (cont.) • The table is initialized to the co-occurrence of blobs and words • The final translation probability table is constructed using EM algorithm which iterates between two steps: • use an estimate of the probability table to predict correspondence • then use the correspondences to refine the estimate of the probability table • Once learned, the table is used to predict words corresponding to particular image

Correspondences on Video (1/3) • Broadcast news is very challenging data • Due to its nature it is mostly based on people and requires person detection/ recognition. • Data set • Subsets of Chinese culture and TREC 2001 data sets which are relatively simpler • Consists of videoframes and the associated transcript extracted from the audio (Sphinx-III speech recognizer). • The frames and transcripts are associated on the shot-basis

Correspondences on Video (2/3) • Keyframe • Segmented into regions by fixed sized grids • A feature vector of size 46 is formed to represent each region • Position: (x, y) of the region center • Color: using the mean and variance of the HSV and RGB • Texture: using the mean and variance of 16 filter • Four difference of Gaussian filters with different sigmas and twelve oriented filters, aligned in 30 degree increments

Correspondences on Video (3/3) • Vocabulary • Consists of only nouns which are extracted from applying Brill’s tagger to the transcript • Contain noisy words in the vocabulary

TREC 2001 Data (1/3) • TREC 2001 Data set • 2232 keyframes and 1938 nouns • Difference between still images and video frames • Text for the surrounding frames is also considered by setting the window size to five • Process • Each image is divided into 7 * 7 blocks (49 regions) • Feature space is vector quantized using k-means (k=500) • apply EM to obtain final translation probability between 500 blob tokens and 1938 word tokens

TREC 2001 Data (2/3) • Example annotation results for TREC 2001 data

TREC 2001 Data (3/3) • Experiment results (Statue of Liberty) before after

Chinese Culture Data • Example of “great wall” • 3785 shots and 2597 words • After pruning process, 2785 shots and 626 words

Chinese Culture Data • Experimental results (panda, wall, emperor)

Chinese Culture Data • Evaluate the results on a larger scale 189 images for the word panda

Chinese Culture Data • The rank of the word panda as the predicted word for the corresponding frames • Red: test set • Green: training set • Problem: the woman frames highly co-occur with word panda

Chinese Culture Data • The effect of window size • a single shot, and window size is set to 1, 2 or 3 • Recall: # of correct predictions over the # of times that the word occurs in the data • Precision: # of correct predictions over all predictions

Chinese Culture Data • Experimental results of the effect of window size a single shot win size =3

Chinese Culture Data • Experimental results of the effect of window size (cont.) • For some selected words

Discussion and Future work • Discussion • Solve the correspondence problem between video frames and associated text • relatively simpler and smaller data sets are used • Broadcast news which is a harder dataset since there are terabytes of video and it requires focusing on people

Discussion and Future work • Better visual feature • some detectors (face detector) • motion information • segmenter • use the temporal information to segment the moving objects • Text • Noun phrases or compound words • More lexicalanalysis

Associating Video Frames with Text

Associating Video Frames with Text

Presentation Transcript

Working with Windows and Frames

Work with Text

Developing Character Identity: Associating With Archetypes

Writing with External Text Frames

Frames

Associating Genomic V ariations with Phenotypes

Text Classification with Belief Augmented Frames

Working with text

Associating Narrowing Functions to Constraints

Video Frames Interpolation Using Adaptive Warping

Associating

Creating A Page with Frames

GMS Linkage with FRAMES

Video and Text Conferencing

Frames

DisasterManual with green frames

Convert anything or everything in text with Perfect Video Transcription

GMS Linkage with FRAMES

6 Benefits Of Associating With A Professional SEO Firm

Wire Frames with LED Lights

Associating Short Dialing With Creative Advertising

Ai text to video