Towards Computer Understanding of Human Interactions: Recognising Sequences of Meeting Actions

Towards Computer Understanding of Human Interactions:Recognising Sequences of Meeting Actions Iain McCowan, Daniel Gatica-Perez, Samy Bengio, Dong Zhang, Darren Moore, Guillaume Lathoud, Mark Barnard, Herve Bourlard. IDIAP Research Institute Martigny Switzerland

Meetings: Sequences of Actions • Meetings are commonly understood as sequences of events or actions: • meeting agenda: prior sequence of discussion points, presentations, decisions to be made, etc. • meeting minutes: posterior sequence of key phases of meeting, summarised discussions, decisions made, etc. • We aim to investigate the automatic structuring of meetings as sequences of meeting actions. • In general, these action sequences are due to the group as a whole, rather than a particular individual.

The Goal Timeline Meeting Phase Presentation by Fred Group Discussion Topic Beach Budget Group Interest Level High Neutral High Group Task Information Sharing Decision Making

Research Questions • Such a framework for structuring of meetings poses 3 main research questions: • What meeting actions can we define? • What evidence can we observe to allow us to recognise these actions? • How can we model the underlying process generating these observations?

1. Defining Meeting Actions • The meeting actions we define depend on the view we require. • e.g. group discussion state, evolution of group interest level, progression of topic, high-level task phases,... • Multiple parallel (concurrent) views are possible, each defined by a set of meeting actions. • Actions within a given set should be: • consistent: describing one view, answering one question, • mutually exclusive: non-overlapping, and • exhaustive: covering the entire meeting.

Defining Meeting Action Sets • Meeting action sets could be defined in many ways, including: • Technology driven: what actions do we expect we can recognise given state-of-the-art technology? • Application driven: to respond to user requirements or typical queries. • Theoretically motivated: based on coding schemes from social psychology group research, e.g. [1][2]. • Automatically: clustering of observations. [1] Bales, “Interaction Process Analysis: A method for the study of small groups”, 1951. [2] McGrath, “Groups: Interaction and Performance”, 1984.

2. Observing Meeting Actions • Using microphones, cameras, and other devices, we can observe individual participants in the meeting, as well as the entire group.

camera1 camera3 camera2 A Set of ‘Multi’modal Observations • Person-specific audio-visual features • Audio • Seat region audio activity • Speech Pitch • Speech Energy • Speech Rate • Visual • Head vertical centroid • Head eccentricity • Right hand centroid • Right hand angle • Right hand eccentricity • Head and hand motion • Group-level audio-visual features • Audio • Audio activity from white-board region • Audio activity from screen region • Visual • Mean difference from white-board • Mean difference from projector screen

Meeting Individual 3. Modeling Meeting Actions • Two important characteristics of meeting actions: • Multi-modal nature: people participate using speech, gestures, expressions, gaze, written text, using devices, audio (non-speech) cues such as laughter, etc. • Group nature: while some meeting actions may be attributed to one individual, in general they are the result of interactions between the group. V A

Multi-stream HMM Variants • A well-known sequence model is the HMM: • a process is a sequence of states, where current observation depends on current state, which depends on previous state. • Some variants for modeling multiple streams include: • Early-integration HMM • Features concatenated in single vector, modeled with single HMM. • Feature-level correlation, frame synchronous streams. • Multi-stream HMM (MS-HMM) • Each stream modeled independently, with stream likelihoods combined at certain anchor points. • Feature-level independence, allows some inter-stream asynchrony.

Multi-stream HMM Variants • Some variants capable of modeling multiple streams include: • Asynchronous HMM(A-HMM) • Models multiple streams using a single state sequence, but where a state may emit on one or more streams at a given time, according to a synchronisation variable. • Feature-level correlation, allows inter-stream asynchrony. • Coupled HMM (C-HMM) • Similar to multi-stream HMM, but the current state in a given stream depends on the previous state in all streams (allows causal effect). • State-level correlation, allows some inter-stream asynchrony.

A First Set of Experiments • In [3], we investigated recognising meeting actions directly from the multimodal observations. • Multi-stream models investigated, with streams being modalities, or participants. • Set of 8 Meeting Actions: • { monologue (x4), presentation, white-board, discussion, note-taking }. • Corpus: • 59 (30 train, 29 test), 5-minute, 4-person meetings. • meetings were scripted as a sequences of actions, otherwise natural. • Evaluation Metric: • Action Error Rate (AER): analogous to word error rate in ASR. [3] McCowan et al, “Modeling Human Interactions in Meetings”, ICASSP 2003.

Summary of Results • Best systems (Early int. HMM, AV A-HMM) gave 9% AER.

Summary of Results • There is benefit to a multi-modal approach for these actions. • Important to model correlation between individual participants. • participant MS-HMM performs poorly relative to others. • There is nosignificant asynchrony between audio and visual modalities at the group action level in this case. • no significant difference between frame-level and action-level synchrony in multi-stream AV models. • There is evidence of asynchrony between participants acting in these group meeting actions. • action-level synchrony better than frame-level synchrony for participant MS-HMM and C-HMM.

A Two-Layered Approach

A Two-Layered Approach • By defining a set of individual actions, we can decompose the meeting action recognition problem into two layers: individual (I) and group (G). • Posteriors from I-HMM used as input features in G-HMM: • Advantages over single layer HMM: • Each HMM has smaller observation space to model. • I-HMM is person-independent, so can be trained with more data from all different participants. • G-HMM is less sensitive to variations of the low-level audio-visual features. • Extensible to recognise new meeting action lexicon. • For more on Layered HMMs, see Nuria Oliver’s upcoming keynote...

Person 1 S S W Person 4 S W S Example W Person 2 W S W W Person 3 W S S W Presentation Used Whiteboard Used Group Action Monologue1 + Note-taking Discussion Presentation + Note-taking Whiteboard + Note-taking

Experiments with Layered HMMs • Extended set of 14 meeting actions: { ‘discussion’, ‘monologue’ (x4), ‘monologue + note-taking’ (x4), ‘note-taking’, ‘presentation’, ‘presentation + note-taking’, ‘white-board’, ‘white-board + note-taking’ }. • Summary of results [4]: • Best single-layer HMM gave 24% AER • same system giving 9% AER for smaller action set. • Best 2-layer HMM: 15% AER • 9% absolute improvement, significant at 96% level. • Individual layer modeled with AV A-HMM, showing that asynchrony between modalities is important at individual level. • All 2-layer models performed better than single layer models. [4] Zhang et al, “Modeling Individual and Group Actions in Meetings with Layered HMMs”, CVPR-EVENT 2004.

Other Ongoing Research • Richer data corpus being specified, recorded and annotated: • to allow research of a wider variety of meeting actions, • research the integration of higher level features, e.g. words, dialog acts, emotions, etc. • Recognition of Group Level of Interest: • currently achieve 73% frame accuracy recognising {‘high’, ‘neutral’} on same corpus using same feature set. • Unsupervised clustering of meeting actions: • promising initial results achieved using unsupervised training of group level HMM in 2-layer system. • Investigating tractable approximations for asynchronous HMMs with many streams. • Current implementation only tractable for 2 streams.

Summary • Structuring of meetings as sequences of group meeting actions is an interesting research task: • how to deal with data of many different modalities? • how to incorporate high-level features, like words, emotion, dialog acts? • how to efficiently model large numbers of interacting streams, allowing for potential asynchrony? • how to allow variable number of active streams? • how far can we go towards recognising more interesting high-level actions in meetings?

Towards Computer Understanding of Human Interactions: Recognising Sequences of Meeting Actions

Towards Computer Understanding of Human Interactions: Recognising Sequences of Meeting Actions

Presentation Transcript

Human-Computer Interaction

Purpose of This Chapter • To implement a stored program computer which can execute a set of instructions .

Transposable Elements

Eurocode 1 - Actions on structures -

Assembly Language and Computer Architecture Using C++ and Java

Machine Learning Methods for Human-Computer Interaction

An Overview of Human Error Drawn from J. Reason, Human Error , Cambridge, 1990

Python Programming: An Introduction to Computer Science

From SD to HD Improving Video Sequences Through Super-Resolution

HUMAN ACTION CLASSIFICATION USING 3-D CONVOLUTIONAL NEURAL NETWORKS

Design of Experiment and Assessing Interactions within Atmospheric Processes

HUMAN RIGHTS AT WORK Carrefour’s commitment to respecting fundamental human rights at work

INFINITE SEQUENCES AND SERIES

Globalization and the BRIC ’ s Emergence – Understanding challenges and opportunities

Large Mesh Simplification using Processing Sequences

Comparison of large sequences

Herbs-drugs interactions

Computer and Robot Vision II

Globalization and the BRIC ’ s Emergence – Understanding challenges and opportunities

Alignment in human-human and human-computer interactions