200 likes | 287 Views
Towards Computer Understanding of Human Interactions: Recognising Sequences of Meeting Actions. Iain McCowan, Daniel Gatica-Perez, Samy Bengio, Dong Zhang, Darren Moore, Guillaume Lathoud, Mark Barnard, Herve Bourlard. IDIAP Research Institute Martigny Switzerland.
E N D
Towards Computer Understanding of Human Interactions:Recognising Sequences of Meeting Actions Iain McCowan, Daniel Gatica-Perez, Samy Bengio, Dong Zhang, Darren Moore, Guillaume Lathoud, Mark Barnard, Herve Bourlard. IDIAP Research Institute Martigny Switzerland
Meetings: Sequences of Actions • Meetings are commonly understood as sequences of events or actions: • meeting agenda: prior sequence of discussion points, presentations, decisions to be made, etc. • meeting minutes: posterior sequence of key phases of meeting, summarised discussions, decisions made, etc. • We aim to investigate the automatic structuring of meetings as sequences of meeting actions. • In general, these action sequences are due to the group as a whole, rather than a particular individual.
The Goal Timeline Meeting Phase Presentation by Fred Group Discussion Topic Beach Budget Group Interest Level High Neutral High Group Task Information Sharing Decision Making
Research Questions • Such a framework for structuring of meetings poses 3 main research questions: • What meeting actions can we define? • What evidence can we observe to allow us to recognise these actions? • How can we model the underlying process generating these observations?
1. Defining Meeting Actions • The meeting actions we define depend on the view we require. • e.g. group discussion state, evolution of group interest level, progression of topic, high-level task phases,... • Multiple parallel (concurrent) views are possible, each defined by a set of meeting actions. • Actions within a given set should be: • consistent: describing one view, answering one question, • mutually exclusive: non-overlapping, and • exhaustive: covering the entire meeting.
Defining Meeting Action Sets • Meeting action sets could be defined in many ways, including: • Technology driven: what actions do we expect we can recognise given state-of-the-art technology? • Application driven: to respond to user requirements or typical queries. • Theoretically motivated: based on coding schemes from social psychology group research, e.g. [1][2]. • Automatically: clustering of observations. [1] Bales, “Interaction Process Analysis: A method for the study of small groups”, 1951. [2] McGrath, “Groups: Interaction and Performance”, 1984.
2. Observing Meeting Actions • Using microphones, cameras, and other devices, we can observe individual participants in the meeting, as well as the entire group.
camera1 camera3 camera2 A Set of ‘Multi’modal Observations • Person-specific audio-visual features • Audio • Seat region audio activity • Speech Pitch • Speech Energy • Speech Rate • Visual • Head vertical centroid • Head eccentricity • Right hand centroid • Right hand angle • Right hand eccentricity • Head and hand motion • Group-level audio-visual features • Audio • Audio activity from white-board region • Audio activity from screen region • Visual • Mean difference from white-board • Mean difference from projector screen
Meeting Individual 3. Modeling Meeting Actions • Two important characteristics of meeting actions: • Multi-modal nature: people participate using speech, gestures, expressions, gaze, written text, using devices, audio (non-speech) cues such as laughter, etc. • Group nature: while some meeting actions may be attributed to one individual, in general they are the result of interactions between the group. V A
Multi-stream HMM Variants • A well-known sequence model is the HMM: • a process is a sequence of states, where current observation depends on current state, which depends on previous state. • Some variants for modeling multiple streams include: • Early-integration HMM • Features concatenated in single vector, modeled with single HMM. • Feature-level correlation, frame synchronous streams. • Multi-stream HMM (MS-HMM) • Each stream modeled independently, with stream likelihoods combined at certain anchor points. • Feature-level independence, allows some inter-stream asynchrony.
Multi-stream HMM Variants • Some variants capable of modeling multiple streams include: • Asynchronous HMM(A-HMM) • Models multiple streams using a single state sequence, but where a state may emit on one or more streams at a given time, according to a synchronisation variable. • Feature-level correlation, allows inter-stream asynchrony. • Coupled HMM (C-HMM) • Similar to multi-stream HMM, but the current state in a given stream depends on the previous state in all streams (allows causal effect). • State-level correlation, allows some inter-stream asynchrony.
A First Set of Experiments • In [3], we investigated recognising meeting actions directly from the multimodal observations. • Multi-stream models investigated, with streams being modalities, or participants. • Set of 8 Meeting Actions: • { monologue (x4), presentation, white-board, discussion, note-taking }. • Corpus: • 59 (30 train, 29 test), 5-minute, 4-person meetings. • meetings were scripted as a sequences of actions, otherwise natural. • Evaluation Metric: • Action Error Rate (AER): analogous to word error rate in ASR. [3] McCowan et al, “Modeling Human Interactions in Meetings”, ICASSP 2003.
Summary of Results • Best systems (Early int. HMM, AV A-HMM) gave 9% AER.
Summary of Results • There is benefit to a multi-modal approach for these actions. • Important to model correlation between individual participants. • participant MS-HMM performs poorly relative to others. • There is nosignificant asynchrony between audio and visual modalities at the group action level in this case. • no significant difference between frame-level and action-level synchrony in multi-stream AV models. • There is evidence of asynchrony between participants acting in these group meeting actions. • action-level synchrony better than frame-level synchrony for participant MS-HMM and C-HMM.
A Two-Layered Approach • By defining a set of individual actions, we can decompose the meeting action recognition problem into two layers: individual (I) and group (G). • Posteriors from I-HMM used as input features in G-HMM: • Advantages over single layer HMM: • Each HMM has smaller observation space to model. • I-HMM is person-independent, so can be trained with more data from all different participants. • G-HMM is less sensitive to variations of the low-level audio-visual features. • Extensible to recognise new meeting action lexicon. • For more on Layered HMMs, see Nuria Oliver’s upcoming keynote...
Person 1 S S W Person 4 S W S Example W Person 2 W S W W Person 3 W S S W Presentation Used Whiteboard Used Group Action Monologue1 + Note-taking Discussion Presentation + Note-taking Whiteboard + Note-taking
Experiments with Layered HMMs • Extended set of 14 meeting actions: { ‘discussion’, ‘monologue’ (x4), ‘monologue + note-taking’ (x4), ‘note-taking’, ‘presentation’, ‘presentation + note-taking’, ‘white-board’, ‘white-board + note-taking’ }. • Summary of results [4]: • Best single-layer HMM gave 24% AER • same system giving 9% AER for smaller action set. • Best 2-layer HMM: 15% AER • 9% absolute improvement, significant at 96% level. • Individual layer modeled with AV A-HMM, showing that asynchrony between modalities is important at individual level. • All 2-layer models performed better than single layer models. [4] Zhang et al, “Modeling Individual and Group Actions in Meetings with Layered HMMs”, CVPR-EVENT 2004.
Other Ongoing Research • Richer data corpus being specified, recorded and annotated: • to allow research of a wider variety of meeting actions, • research the integration of higher level features, e.g. words, dialog acts, emotions, etc. • Recognition of Group Level of Interest: • currently achieve 73% frame accuracy recognising {‘high’, ‘neutral’} on same corpus using same feature set. • Unsupervised clustering of meeting actions: • promising initial results achieved using unsupervised training of group level HMM in 2-layer system. • Investigating tractable approximations for asynchronous HMMs with many streams. • Current implementation only tractable for 2 streams.
Summary • Structuring of meetings as sequences of group meeting actions is an interesting research task: • how to deal with data of many different modalities? • how to incorporate high-level features, like words, emotion, dialog acts? • how to efficiently model large numbers of interacting streams, allowing for potential asynchrony? • how to allow variable number of active streams? • how far can we go towards recognising more interesting high-level actions in meetings?