Mohak Kumar Sukhwani 201307583 Advisor: Prof. C. V. Jawahar

Understanding and Describing Tennis Videos Mohak Kumar Sukhwani 201307583 Advisor: Prof. C. V. Jawahar Center for Visual Information Technology, IIIT-Hyderabad, India

Sports Video Analysis Cricket: Temporal segmentation and annotation of actions with semantic descriptions. Snooker and volley ball: (Left) Analysis of shot trajectories and stroke analysis . (Right) Player identification and action recognition.

Ice-Hockey: Player recognition and tracking on field. Soccer: Real-time football analysis include automatic game summarization, player tracking, highlight extraction

Handball: Trajectory-based handball video understanding. Basketball: Tracking players under global appearance constraints.

Computer Vision and Language Processing < video slide – motivation > How will you describe it?

Visual-Semantic Alignments (Varied Approaches)

Our Approach Descriptions IN, Winner: Serena!!! High kick serve, Williams returns a backhand return, short rally, Sharapova cross-court backhand lands out-side the court.

Tennis Data New Video

Does confining the domain help? Frequency comparison of unrestricted tennis text (tennis news, blogs, etc.- denoted by `*’) with tennis commentaries.

Phrase Recognition Description Retrieval Action Recognition Action Localization

Dataset (a) Annotated-action. (b) Video commentary.

Text Corpus Source: Tennis Earth - http://www.tennisearth.com/.

Action Localization Player Detection Phrase recognition accuracy averaged over top 5 retrieval. Player Detectionon test videos.

Player Recognition • color based descriptors (MPEG-7 SCD, CLD) • edge based descriptor (MPEG-7 EHD) • color and texture information (MPEG-7-like CEDD)

Weak Learners for Action recognition Feature Extraction(Dense Trajectory) Encoding and Pooling ( Bag of Words) Discriminative Classifier (Multiclass SVM) Activity Action level of semantics waits for ball, serves a good one, crafts a forehand return forehand, backhand, volley

Improved Dense Trajectories as a feature vector ! Dense Sampling in each spatial scale Trajectory-aligned descriptors Feature tracking - Capture the intrinsic dynamic structures in video - MBH is robust to camera motion - Detect human body to remove spurious trajectories

What's with Camera motion ? Separate models for upper and lower action !

We are already done with Training !

We test ontennis point videos. Pairwise phrase cohesion MRF based Temporal Smoothing. SVM score Retrieval Module

How about, joint model for phrase classification? - Semi automatic process for phrase alignment. - No manual shot sampling. - No tiring action annotations.

(subject), (object), (subject;verb), (object;verb), (subject;prep;object), (object;prep;object), (attribute;subject), (attribute;object) and (verb;prep;object). Commentary Text 9 phrase encodings

IN, Winner: Serena!!! Huge serve. Ace !!! <winner Serena>, <huge serve>, <ace> IN, Winner: Zvonareva !!! Good serve in the middle, Williams returns a quick forehand return, short rally, Serena cross-court fails to clear the net in the middle. <winner Zvonareva >,<quick return>, <short rally>,<Williams return return>,<cross-court fail>

Probabilistic Label Consistent KSVD Action Trajectory matrix Sparse code Phrase label matrix Optimal dictionary Tennis point video Sliding window PC - Phrase cluster Y = H =

For test videos,

Commentary generation Commentary Collection Phrases + Players Online Offline Representation [ tf-idf/LSI ] Representation [ tf-idf/LSI ] Query Representation Document Representation Index Comparison Function [ TF-IDF/LSI ] Voila !

LSI for better text retrieval SVD n > k Term based indexing Latent Concept based indexing • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in this latent semantic space

Illustration of the approach Input sequence of videos is first translated into a set of phrases, which are then used to produce the final description

Quantitative Comparisons Template based CNN + RNN CCA + Semantic Correlation matching CCA + SSVM

The premise is indeed true. Confined domain does help !

Qualitative Results

Qualitative Comparisons Youtube2Text : S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell & K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013. RNN: A. Karpathy and L. Fei-Fei. Deep visual-vemantic alignments for generating image descriptions. In CVPR, 2015.

Human Evaluation

Contribution

< video slide >

Other Applications 1. Smart theatrics: Narration generation for dance dramas. Ballet Kathak Kabuki 2. Sports: Other sporting events. Baseball Volley ball Cricket

PossibleExtensions! Longer Text More realistic and exhaustive game description. (Requires better topic modelling and retrieval methods) Data collection a challenge – too much of variations.

Ball Tracking Tried simple kalman filtering. How about RNNs ? Will it actually help and add to content understanding ?

Related Publications • Mohak Sukhwani and C.V. Jawahar, Tennis Vid2Text : Fine-Grained Descriptions for Domain Specific Videos, Proceedings of the 26th British Machine Vision Conference (BMVC), 07-10 Sep 2015, Swansea, UK. • Mohak Sukhwani and C.V. Jawahar, Frame level Annotations for Tennis Videos, 23rd International Conference on Pattern Recognition, ICPR 2016 (Under Review)

< video slide – human evaluation >

Mohak Kumar Sukhwani 201307583 Advisor: Prof. C. V. Jawahar