150 likes | 284 Views
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011: Research and development in Information Retrieval - Katja Filippova - Keith B. Hall . Presenter Viraja Sameera Bandhakavi. 1. Contributions.
E N D
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - KatjaFilippova - Keith B. Hall • PresenterVirajaSameeraBandhakavi 1
Contributions • Analyze sources of text information like title, description, comments, etc and show that they provide valuable indications to the topic • Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant • Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently 2
Research question not answered by related work • Can a classifier learn from imperfect predictions of a weakly supervised classifier?Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one? • Do the video and text based classifiers capture different semantics? • How useful is user provided text metadata? Which source is the most helpful? • Can reliable predictions be made from user comments? Can it improve the performance of the classifier? 3
Methodology • Builds on top of the predictions of Video2Text • Uses Video2Text: • Requires no labeled data other than video metadata • Clusters similar videos and generates a text label for each cluster • The resulting label set is larger and better suited for categorization of video content on YouTube 4
Video2Text • Starts from a set of weak labels based on the video metadata • Creates a vocabulary of concepts (unigrams or bigrams from the video metadata) • Every concept is associated with a binary classifier trained from a large set of audio and video signals • Positive instances- videos that mention the concept in the metadata • Negative instances-videos which don’t mention the concept in the metadata 5
Procedure • Binary classifier is trained for every concept in the vocabulary • Accuracy is assessed on a portion of a validation dataset • Each iteration uses a subset of unseen videos from the validation set • The classifier and concept are retained if precision and recall are above a threshold (0.7 in this paper) • The remaining classifiers are used to update the feature vectors of all videos • Repeated until the vocabulary size doesn’t change much or the maximum number of iterations is reached • Finer grained concepts are learned from concepts added in the previous iteration • Group together labels related to news, sports, film, etc resulting in the final set of 75 two level categories 6
Categorization with Video2Text • Use Video2Text to assign two-level categories to videos • Total number of binary classifiers (hence labels) limited to 75 • Output of Video2Text represented as a list of strings: (vi , cj,sij, ) 7
Distributed MaxEnt • Approach automatically generates training examples for the category classifier • Uses conditional maximum entropy optimization criteria to train the classifiers • Results in a conditional probability model over the classes given the YouTube videos. 8
Data and Models • Text models differ regarding the text sources from which the features are extracted: title, description, comments, etc • Features used are all token based • Infrequent tokens are filtered out to reduce feature space • Token frequencies are calculated over 150K videos • Every unique token is counted onceper video • Threshold token frequency of 10 is used • Tokens are prefixed with the first letter of where it was found • eg: T:xbox, D:xbox, U:xbox, C:xbox, etc 9
Combined Classifier • Used to see if the combination of the two views – video and text based, is beneficial • A simple meta classifier is used, which ranks the video categories based on predictions of the two classifiers • Video based predictions are converted to a probability distribution • The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied • This approach proved to be effective • Idea: Each classifier has a veto power • The final prediction for each video is the one with the highest product score 10
Experiments- Evaluation of Text Models • Training data set containing 100K videos which get high scoring prediction • Correct prediction – score of at least 0.85 from Video2Text • Text based prediction must be in the set of video-assigned categories • Evaluation was done on two sets of videos: • Videos with at least one comment • Videos with at least 10 comments 11
Experiments- Evaluation of Text Models Contd… • The best model is TDU+YT+C for both sets • This model is used for comparison against Video2Text model with human raters • This model is also used in the Combination model 12
Experiments with Human Raters • Total of 750 videos are extracted equally from the 15 YouTube categories • Human rater rates (video, category) as -fully correct (3), partially correct(2), somewhat related(1) or off topic (0) • Every pair received from 3 human raters • The three ratings are summed and normalized (by dividing by 9) and rounded off to get the resultant score 13
Experiments with Human Raters Contd… • Score of at least 0.5 – correct category • Text based model performs significantly better than video model • Combination model improved accuracy • Accuracy of all models increases with number of comments 14
Conclusion • Text based approach for assigning categories to videos • Competitive classifier trained on high-scoring predictions made by a weakly supervised classifier (video features) • Text and video models provide complementary views on the data • Simple combination model outperforms each model on its own • Accurate predictions from user comments • Reasons for impact of comments: • Substitute for a proper title • Disambiguate the category • Help correct wrong predictions • Future work: Investigate usefulness of user comments for other tasks 15