400 likes | 406 Views
Understanding and Describing Tennis Videos. Mohak Kumar Sukhwani 201307583 Advisor: Prof. C. V. Jawahar Center for Visual Information Technology, IIIT-Hyderabad, India. Sports Video Analysis. Cricket : Temporal segmentation and annotation of actions with semantic descriptions.
E N D
Understanding and Describing Tennis Videos Mohak Kumar Sukhwani 201307583 Advisor: Prof. C. V. Jawahar Center for Visual Information Technology, IIIT-Hyderabad, India
Sports Video Analysis Cricket: Temporal segmentation and annotation of actions with semantic descriptions. Snooker and volley ball: (Left) Analysis of shot trajectories and stroke analysis . (Right) Player identification and action recognition.
Ice-Hockey: Player recognition and tracking on field. Soccer: Real-time football analysis include automatic game summarization, player tracking, highlight extraction
Handball: Trajectory-based handball video understanding. Basketball: Tracking players under global appearance constraints.
Computer Vision and Language Processing < video slide – motivation > How will you describe it?
Visual-Semantic Alignments (Varied Approaches)
Our Approach Descriptions IN, Winner: Serena!!! High kick serve, Williams returns a backhand return, short rally, Sharapova cross-court backhand lands out-side the court.
Tennis Data New Video
Does confining the domain help? Frequency comparison of unrestricted tennis text (tennis news, blogs, etc.- denoted by `*’) with tennis commentaries.
Phrase Recognition Description Retrieval Action Recognition Action Localization
Dataset (a) Annotated-action. (b) Video commentary.
Text Corpus Source: Tennis Earth - http://www.tennisearth.com/.
Action Localization Player Detection Phrase recognition accuracy averaged over top 5 retrieval. Player Detectionon test videos.
Player Recognition • color based descriptors (MPEG-7 SCD, CLD) • edge based descriptor (MPEG-7 EHD) • color and texture information (MPEG-7-like CEDD)
Weak Learners for Action recognition Feature Extraction(Dense Trajectory) Encoding and Pooling ( Bag of Words) Discriminative Classifier (Multiclass SVM) Activity Action level of semantics waits for ball, serves a good one, crafts a forehand return forehand, backhand, volley
Improved Dense Trajectories as a feature vector ! Dense Sampling in each spatial scale Trajectory-aligned descriptors Feature tracking - Capture the intrinsic dynamic structures in video - MBH is robust to camera motion - Detect human body to remove spurious trajectories
What's with Camera motion ? Separate models for upper and lower action !
We test ontennis point videos. Pairwise phrase cohesion MRF based Temporal Smoothing. SVM score Retrieval Module
How about, joint model for phrase classification? - Semi automatic process for phrase alignment. - No manual shot sampling. - No tiring action annotations.
(subject), (object), (subject;verb), (object;verb), (subject;prep;object), (object;prep;object), (attribute;subject), (attribute;object) and (verb;prep;object). Commentary Text 9 phrase encodings
IN, Winner: Serena!!! Huge serve. Ace !!! <winner Serena>, <huge serve>, <ace> IN, Winner: Zvonareva !!! Good serve in the middle, Williams returns a quick forehand return, short rally, Serena cross-court fails to clear the net in the middle. <winner Zvonareva >,<quick return>, <short rally>,<Williams return return>,<cross-court fail>
Probabilistic Label Consistent KSVD Action Trajectory matrix Sparse code Phrase label matrix Optimal dictionary Tennis point video Sliding window PC - Phrase cluster Y = H =
Commentary generation Commentary Collection Phrases + Players Online Offline Representation [ tf-idf/LSI ] Representation [ tf-idf/LSI ] Query Representation Document Representation Index Comparison Function [ TF-IDF/LSI ] Voila !
LSI for better text retrieval SVD n > k Term based indexing Latent Concept based indexing • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in this latent semantic space
Illustration of the approach Input sequence of videos is first translated into a set of phrases, which are then used to produce the final description
Quantitative Comparisons Template based CNN + RNN CCA + Semantic Correlation matching CCA + SSVM
The premise is indeed true. Confined domain does help !
Qualitative Comparisons Youtube2Text : S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell & K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013. RNN: A. Karpathy and L. Fei-Fei. Deep visual-vemantic alignments for generating image descriptions. In CVPR, 2015.
Other Applications 1. Smart theatrics: Narration generation for dance dramas. Ballet Kathak Kabuki 2. Sports: Other sporting events. Baseball Volley ball Cricket
PossibleExtensions! Longer Text More realistic and exhaustive game description. (Requires better topic modelling and retrieval methods) Data collection a challenge – too much of variations.
Ball Tracking Tried simple kalman filtering. How about RNNs ? Will it actually help and add to content understanding ?
Related Publications • Mohak Sukhwani and C.V. Jawahar, Tennis Vid2Text : Fine-Grained Descriptions for Domain Specific Videos, Proceedings of the 26th British Machine Vision Conference (BMVC), 07-10 Sep 2015, Swansea, UK. • Mohak Sukhwani and C.V. Jawahar, Frame level Annotations for Tennis Videos, 23rd International Conference on Pattern Recognition, ICPR 2016 (Under Review)