210 likes | 315 Views
…. …. …. Exploiting video information for Meeting Structuring. Agenda. Introduction Feature set extension Video features processing Video features integration Preliminary results Conclusions. Meeting Structuring (1).
E N D
…. …. …. Exploiting video information for Meeting Structuring
Agenda • Introduction • Feature set extension • Video features processing • Video features integration • Preliminary results • Conclusions
Meeting Structuring (1) • Goal: recognise events which involve one or more communicative modalities: • Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard • Working environment: “IDIAP framework” : • 69 five minutes long meetings of 4 participants • 30 transcribed meetings • Scripted meeting structure
Meeting Structuring (2) • 3 audio derived feature families: Speaker turns, Prosodic Features, Lexical Features Speaker Turns Mic. Array Beam-forming Prosody Lapel Mic. Pitch baseline Energy Rate Of Speech ASR Lexical features Transcription. M/DI discrimination
Meeting Structuring (3) • Dynamic Bayesian Network based models (using GMTK, Bilmes et al.) • Multi-stream processing (parallel stream processing) • “Counter structure” (state duration modelling) …. …. C0 C0 C0 • 3 feature families: • Prosodic features (S1) • Speaker Turns (S2) • Lexical features (S3) • Leave-one-out cross-validation • over 30 annotated meetings E0 E0 E0 A0 At At+1 …. …. S01 St1 St+11 …. …. S02 St2 St+12 …. Y01 Yt1 Yt+11 Y02 Yt2 Yt+12
Feature set extension (1) Multi-party meeting are multi-modal communicative processes Our features cover only two modalities: audio (prosodic features & speaker turns) and lexical content (lexical monologue/dialogue discriminator) Exploiting video contents is the next step!!
Feature set extension (2) Goal: improve the recognition of “Note taking”, “Presentation” and “Whiteboard” The three most confused symbols Three meeting actions which highly involve body/hands movements Approach: extract low level video features and leave their interpretation to high level specialised models
Feature set extension (3) We need motion features for hands/head-torso regions • Constraints: • The system must be simple • Reliable against “environmental” changes (lighting, backgrounds, …) • Open to further extensions / modifications • Initial assumptions: • Meetings video contents are quite “static” • Participants occupy only few spatial regions and tend to stay there • Meeting room configuration (camera positions, seats, furniture …) is fixed
Video feature extraction (1) • Motion analysis is performed using : Kanade Lucas Tomasi (KLT) feature tracking… …and partitioning resulting trajectories according to their relative position into the scene Four spatial regions for each scene: Head 1 / 2 Hands 1 / 2
KLT (1) Assumption: brightness of every point of a (slow) moving or static object does not change for images taken at near time instants (Taylor series approximated to the 1st derivative) Optical flow constraint equation : Represents how fast the intensity is changing with time Brightness gradient Moving object speeds If we have one equation in two unknown; hence more than one solution
KLT (2) are neighbour points of x, with same constant velocity • Minimizing weighted least square error: • In two dimensions the system has the form: • If the solution is :
KLT (3) A good feature is : • one that can be tracked well … (Tomasi et al.) if are the eigenvalues of , the system is well-conditioned if: • … and even better if it is part of a human body (high texture content) Large eigenvalues, but in the same range Pixel with higher probability to be skin are preferred
KLT (4) KLT feature tracking consists of 3 steps : • Select n good features • Track the selected n features • Replace lost features We decided to track n=100 features is a square (7x7) window
Skin modelling Color based approach: (Cr,Cb) chromatic subspace Skin samples taken from unused meetings Initial experiments made using a single Gaussian Now: 3 components Gaussian Mixture Model
Video feature extraction (2) Structure of the implemented system: Video Skin Detection KLT Skin model 100 features / frame Trajectory Structure 100 trajectories / frame
Video feature extraction (3) Trajectory Structure Remove: long and quite static trajectories Define 4 partitions (regions) (2 x heads,2 x hands) H1 H2 Ha1 Ha2 4 regions Trajectories classification + R 4 regions Define 2 additional fixed regions L Evaluate: Average Motion
Video feature extraction (4) 2. 1. 3. 4.
H1 H2 Ha1 Ha2 Video feature extraction (5) Taking motion vectors averaged over many trajectories helps reducing noise For each scene 4 motion vectors, one for each region, are estimated (to be soon enhanced with 2 more regions/vectors L and R) In order to detect if someone is entering or leaving the scene • Open issues: • Loss of tracking for fast moving objects • Account during the tracking • Assumption of a fixed scene structure • Delayed/offline processing
…. …. C0 C0 C0 E0 E0 E0 A0 At At+1 …. …. Speaker turns S01 St1 St+11 …. …. Prosodic features …. S02 St2 St+12 …. S03 St3 St+13 …. …. S04 St4 St+14 …. Y01 Yt1 Yt+11 Lexical features Y02 Yt1 Yt2 Yt+12 Yt+11 Y03 Yt3 Yt+13 Video features Y04 Yt4 Yt+14 Integration Goal: extend the multi-stream model with a new video stream • It is possible that the extended model will be • intractable due to the increased state space • In this case: • State space reduction through a multi-time-scale • approach will be attempted • Early integration of Speaker turns + • Lexical features will be investigated
Preliminary results • Before proceeding with the proposed integration we need to: • compare video performances against the other features families • validate the extracted video features Video features alone have quite poor performances, but they seem to be helpful if evaluated together with Speaker Turns • (Speaker Turns) + (Prosody + Lexical Features) • (Speaker Turns) + (Video Features)
Summary • Extraction of video features through: • A skin detector enhanced KLT feature tracker • Segmentation of trajectories into 4/6 spatial regions (Simple and fast approach, but with some open problems) • Validation of Motion Vectors as a video feature • Integration in the existing framework (work in progress)