Francis Quek Professor of Computer Science Director, Center for Human Computer Interaction

VACE Multimodal Meeting CorpusLei Chen, Travis Rose, Fey Perill, Xu Han, Jilin Tu, Zhongquian Huang, Mary Harper, Francis Quek, David McNeill, Ronald Tuttle, and Thomas Huang • We acknowledge support from: • NSF-STIMULATE program, Grant No. IRI-9618887, “Gesture, Speech, and Gaze in Discourse Segmentation” • NSF- KDI program, Grant No. BCS-9980054, “Cross-Modal Analysis of Signal and Sense: Multimedia Corpora and Tools for Gesture, Speech, and Gaze Research” • NSF-ITR program, Grant No. IIS-0219875, “Beyond The Talking Head and Animated Icon: Behaviorally Situated Avatars for Tutoring” • ARDA-VACE II program “From Video to Information: Cross-Modal Analysis of Planning Meetings” Francis Quek Professor of Computer Science Director, Center for Human Computer Interaction Virginia Tech

Corpus Rationale • A quest for meaning: Embodied cognition and language production drives our research • Analysis of ‘natural’ [human human*] meetings • Resource in support of research in • Multimodal language analysis • Speech recognition and analysis • Vision-based communicative behavior analysis

Why Multimodal Language Analysis? S1 you know like those ¿fireworks? S2well if we're trying to drive'em / out her<r>e # we need to put'em up her<r>e S1 yeah well what I'm saying is we should* S2in front S1 we should do it* we should make it a lin<n>e through the room<m>s / so that they explode like here then here then here then here

Multimodal Language Example

Embodied Communicative Behavior • Constructed dynamically at the moment of speaking (thinking for speaking) • Dependent on cultural, personal, social, cognitive differences • Speaker is often unwitting of gestures • Reveals the contrastive foci of language stream (Hajcova, Halliday et. al.) • Is co-expressive (co-temporal) with speech • Is multiply determined • Temporal synchrony is critical for analysis

In a Nutshell Gesture/Speech Framework: (McNeill 1992, 2000, 2001, Quek et al 1999-2003)

ARDA/VACE Program • ARDA is to the intelligence community what DARPA is to the military • Interest is in the exploitation of video data (Video Analysis and Content Exploitation) • A key VACE challenge: Meeting Analysis • Our key theme: Multimodal communication analysis

From Video to Information: Cross-Modal Analysis for Planning Meetings

Team Multimodal Meeting Analysis: A Cross-Disciplinary Enterprise

Overarching Approach • Coordinated multidisciplinary research • Corpus assembly • Data is transcribed and coded for relevant speech/language structure • War-gaming (planning) scenarios are captured to provide real planning behavior in a controlled experimental context (reducing many ‘unknowns’) • Meeting room is multiply instrumented with cross-calibrated video, synchronized audio/video, motion tracking • All data components are time-aligned across the dataset • Multimodal video processing research • Research on posture, head position/orientation, gesture tracking, hand-shape recognition, and in multimodal integration • Research in tools for analysis, coding and interpretation • Speech analysis research in support of multimodality

Scenarios • Each Scenario to have Five Participants • Roles Tailored to Available Participant Expertise • Five Initial Scenarios • Delta II Rocket Launch • Foreign Material Exploitation • Intervention to Support Democratic Movement • Humanitarian Assistance • Scholarship Selection

Scenarios (cont’d) • Planned Scenarios (to be Developed) • Lost Aircraft Crisis Response • Hostage Rescue • Downed Pilot Search & Rescue • Bomb Shelter Design

Scenario Development • Humanitarian Assistance Walkthrough • Purpose: Develop Plan for Immediate Military Support to Dec 04 Asian Tsunami Victims • Considerable Open Source Information from Internet for Scenario Development • Roles: • Medical Officer • Task Force Commander • Intel Officer • Operations Officer • Weather Officer Mission Goals & Priorities Provided for Each Role

Meulaboh, Indonesia “As intelligence officer, your role is to provide intelligence support to OPERATION UNIFIED ASSISTANCE. While the extent of damage is still unknown, early reporting indicates that coastal areas throughout South Asia have been affected. Communications have been lost with entire towns. Currently, the only means of determining the magnitude of destruction is from overhead assets. Data from the South Asia and Sri Lanka region has already been received from civilian remote sensing satellites. Although the US military will be operating in the region on a strictly humanitarian mission, the threat still exists of hostile action to US personnel by terrorist factions opposed to the US. As intel officer, you are responsible for briefing the nature of the terrorist threat in the region.” Before Tsunami After Tsunami

Corpus Assembly

10-camera video & digital audio capture 3D Vicon Extraction Data Acquisition & Processing Video Processing: 10-Camera Calibration, Vector Extraction, Hand Tracking, Gaze Tracking, Head Modeling, Head Tracking, Body Tracking Multi-modal Elicitation Experiment Motion Capture Interpretation Time Aligned Multimedia Transcription Speech & Psycholinguistic Coding Speech Transcription, Psycholinguistic Coding Speech & Audio ProcessingAutomatic Transcript Word/Syllable Alignment to Audio, Audio Feature Extraction

H G F E F H A F G H Meeting Room and Camera Configuration E D F H A B B C D D E A B B C D

Cam1

48 Calibration Dots for Calibration • 18 Vicon Markers for Coordinate System Transformation Y=RX+T Global & Pairwise Camera Calibration

(Camera pair 5~12) Error Distributions in Meeting Room Area X Direction maximum: 0.5886mm minimum: 0.4mm mean: 0.4755mm Error Distribution in X Direction Error Distribution in Y Direction Y Direction maximum: 0.6925mm minimum: 0.3077mm mean: 0.4529mm Z Direction maximum: 0.5064 mm minimum: 0.3804mm mean: 0.4317mm Error Distribution in Z Direction

VICON Motion Capture • Motion capture technology: • Near-IR cameras • Retro-reflective markers • Datastation + PC workstation • Vicon modes of operation: • Individual points (as seen in calibration) • Kinematic models • Individual objects

VICON Motion Capture • Learning about MoCap • 11/03: Initial Installation • 6/04: Pilot scenario, using kinematic models • 10/04: Follow-up training using object models • 11/04: Rehearsed using Vicon with object models • 1/05: Data captured for FME scenario • Export position information for each participant’s head, hand, body position & orientation • Post-processing of motion capture data: ~1 hour per minute for a 5-participant meeting • Incorporating MoCap into Workflow • Labeling of point clusters is labor intensive • 3 Work Studies @ 20 hours/wk = ~60 minutes (1 dataset) per week

Speech Processing Tasks • Formulate an audio work flow to support the efficient and effective construction of a large-size high quality multimodal corpus • Implement support tools to achieve the goal • Package time-aligned word transcriptions into appropriate data formats that can be efficiently shared and used

Audio Processing Forced Alignment audio Audio Recording, Meeting Metadata Annotation OOV Word Resolution Audio Segmentation audio transcription segmentation Corpus Integration Manual Transcription

VACE Metadata Approach

Data Collection Status • Pilot: June 04 • Low Audio Volume. Sound Mixer Purchased • Video Frame Drop-out. Purchased High Grade DV Tapes • (AFIT 02-07-05 Democratic movement assistance) • (AFIT 02-07-05 Democratic movement assistance, session 2) • audio clipping in close-in mikes -- may be able to salvage data using the desktop mics. • AFIT 02-24-05 Humanitarian Assistance (Tsunami) • AFIT 03-04-05 Humanitarian Assistance (Tsunami) • AFIT 03-18-05 Scholarship selection • AFIT 04-08-05 Humanitarian Assistance (Tsunami) • AFIT 04-08-05 Card Game • AFIT 04-25-05 Problem Solving Task, (cause of deterioration of Lincoln Memorial) • AFIT 06-??-05 Problem Solving Task

Some Multimodal Meeting Room Results

F2 & F1 Lance Armstrong Episode NIST Microcorpus July 29, 2003 Meeting Dynamics – F1 vs F2

Gaze - NIST July 29, 2003 Data Instrumental gaze Gaze direction tracks social patterns (interactive gaze) and engagement of objects (instrumental gaze), which may be a form of pointing as well as perception Interactive gaze occurrences Interactive gaze - 5 min. sample: Gaze target Gaze source Instrumental gaze

General’s Rep. CO (Moderator) Engineering Lead Gaze - AFIT data Gazee Gazer

F-formation analysis • “An F-formation arises when two or more people cooperate together to maintain a space between them to which they all have direct and exclusive [equal] access.” (A. Kendon 1977). • An F-formation is discovered from tracking gaze direction in a social group. • It is not only about shared space. • It reveals common ground and has an associated meaning. • The cooperative property is crucial. • It is useful for detecting units of thematic content being jointly developed in a conversation.

NIST-F-Formation Coding (76.11s–92.27s)

NIST-F-Formation Coding (92.27s–108.97s)

Summary • Corpus collection based on sound scientific foundations • Data includes audio, video, motion-capture, speech transcription, and manual codings • A suite of tools for visualizing and coding the cotemporal data has been developed • Research results demonstrate multimodal discourse segmentation and meeting dynamics analysis

Francis Quek Professor of Computer Science Director, Center for Human Computer Interaction

Francis Quek Professor of Computer Science Director, Center for Human Computer Interaction

Presentation Transcript

Human-Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human-Computer Interaction

Human-Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human-Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human Computer Interaction

Human-Computer Interaction

Human Computer Interaction

Francis Quek Professor of Computer Science Director, Center for Human Computer Interaction