Recognition of Dialogue Acts in Meetings

Recognition of Dialogue Acts in Meetings Alfred Dielmann – a.dielmann@ed.ac.uk Steve Renals – s.renals@ed.ac.uk Centre for Speech Technology Research University of Edinburgh

Agenda • Introduction • Meeting data corpus • System overview • Feature set • Factored Language Models • DBN infrastructure • Experimental results • Conclusions and future directions

Goal Dialogue Act recognition Summarisation Meeting phases detection Language Models for Automatic Speech Recognition Topic Detection and Tracking Automatic meeting structuring

Dialogue Acts • Several DA coding schemes can be defined: • Targeted on different conversational aspects • Caracterised by multiple hierarchical levels • Different number of DA labels “Dialog Acts reflect the functions that utterances serve in a discourse” [Ji et al. 05] Building blocks of a conversation

ICSI meeting corpus (1) • Naturally occurring meetings • Unconstrained human-to-human interactions • Unconstrained recording conditions • 75 meetings of 4-10 participants (average 6 participants) • 72 hours multi-channel audio data • Head mounted microphones • 4 tabletop microphones • Fully transcribed • Annotated in terms of Dialogue Acts • MRDA scheme: 11 generic tags + 39 specific sub-tags • More than 2000 unique DA labels • Mappings from MRDA tags to reduced tag sets

ICSI meeting corpus (2) • Five broad DA categories (obtained by manually grouping MRDA tags): • Statements, Questions, Back-channels, Fillers, Disruptions • Imbalanced distribution across DA categories • Same data-set subdivision as in Ang et al.[2005] • Training (51 meetings) / Development (11) / Test (11)

Methodology • Task definition: • Joint approach: DA “segmentation+classification” executed concurrently as a single step instead of sequentially • Generative approach: • Observable sequences of wordsWt • (sentences) and features Yt are • generated by hidden Dialogue Acts • System building blocks: • Trainable statistical based model (dynamic Bayesian network) • Factored language model (maps word sequences into DA units) • Feature extraction component (DA segmentation) • Discourse model: trigram language model over DA label sequences

System overview Transcriptions Factored Language Model Multi-channel audio recordings Automatic Speech Recognition DA labels (training) Discourse Model Feature extraction Dynamic Baysian Network infrastructure for joint DA segmentation and tagging DA segment boundaries + DA labels

System overview Transcriptions Factored Language Model Multi-channel audio recordings Automatic Speech Recognition DA labels (training) DA Tagging DA Segmentation Discourse Model Feature extraction Dynamic Baysian Network infrastructure for joint DA segmentation and tagging DA segment boundaries + DA labels

ASR transcription • Goal: estimate the degradation of DA recognition performances caused by an imperfect transcription • Word level transcription kindly provided by the AMI ASR team • Baseline system developed in early 2005 • Based on a PLP front-end, decision tree clustered crossword triphone models and an interpolated bigram LM • Trained on ICSI meetings • Covers the whole ICSI corpus through a 4-fold cross-validation • About 29% of Word Error Rate (ideal for our task)

Features • 6 continuous word related features: • F0 mean and variance • normalised against baseline pitch • RMS energy • normalised by channel and by typical word energy • Word length • normalised by typical duration • Word relevance • local term frequency / absolute term frequency • Pause duration • re-scaled inter-word pauses

Factored Language Models Generalised language models in which word and word related features (factors) are bundled together Nearly everything!! Word themselves, stems, Part Of Speech tags, relative position in the sentence, morphological classes, DAs, …. Goal: factorise the joint probability associated to a sentence in terms of factor-related conditional probabilities:

Factored Language Models DA tagging accuracies, for the FLM comparison task, have been estimated by integrating a simple decoder into the SRILM toolkit % Correct Score against the reference DA label

Factored Language Models The relationship between words and DAs has been modelled by using a 3 factors FLM and Kneser-Ney discounting wt : word nt : relative word position (…) dt : dialogue act label (hidden)pt : part of speech label mt : meeting type st : word stem …

DBNs C S L • Bayesian Networks are directed probabilistic graphical models: • Nodes represent random variables • Direct arcs represent conditional (in-)dependences among variables • DBNs = extension of BNs to process data sequences or time-series: • Instancing a static BN for each temporal slice t • Explicating temporal dependences between variables • Switching DBNs (Bayesian multi-nets) are adaptive DBNs able to change their internal topology according to the state of one or more variables (switching nodes) HMMs, Kalman Filter models, and many other state-space models could be represented under the same formalism

Generative DBN model (1) • Implemented through a switching DBN model: • Switching variable: DA boundary detector node Et • 2 operative conditions 2 switching model’s topologies • The “intra DA” topology (Et=0) operates by modelling a sequence of words Wt:T which is assumed to be part of a single DA unit • Updates the FLM based DA estimations (joint sentence probabilities: p(Wt|Wt-1,Nt,DA0t) ) • Updates a set of deterministic counter variables (word counter Ct and word block counterNt ) Note: for clarity the next slides show only the BN slices that are actually duplicated for t > 1 , the network topologies adopted for t=0,1 take also care of variable initialisations

Generative DBN model (2) • The “inter DA” state is active only when a transition between different DA units is likely (Et=1) • Models the DA unit transition process • Integrates the discourse model probability • Initialises the deterministic counter variables (Ct, Nt) and forces the FLM to start a new set of estimations (backoff to unigrams) • Updates the DA recognition history (DA1tand DA2t) • The probability of encountering a new DA boundary Et=1 will be estimated during both the operative states from: • Observable continuous feature vector Yt (GMMs) • Word block counter Nt (DA duration model) • Previous DA recognition history DAkt DA hypotheses generation

Generative DBN model (3)

Performance evaluation • Segmentation error metric: • NIST Sentence like unit (NIST-SU*): sum of missed DA boundaries • and False Alarms divided by the number of reference DA units • Recognition error metrics: • NIST Sentence like unit (NIST-SU*): sum of missed DAs, False • Alarms and Substitutions divided by the number of ref. DA units • SClite: sum of % DA substitution, insertion and deletion errors • after a time mediated DA alignment (same as the Word Error • Rate metric used in speech recognition) • Lenient*: 100% - {percentage of correctly classified words • (ignoring DA boundaries)} … … * NIST Sentence like Unit, and Lenient metrics are defined in: [Ang et al. 2005]

Experimental results + DA classification based on (FLM) lexical features plus a 3-gram discourse model (improves tagging by ~5%) … Note: DA tagging accuracy has been estimated by providing the ground truth segmentation (forcing the state of Etnodes)

Recognition results All evaluation metrics show consistent trends/behaviours Encouraging results can be achieved on ASR transcriptions … • “Pause duration” features play a key role for the segmentation task, but • optimal recognition performances can be achieved only through the fully • comprehensive feature setup • A system fully trained on ASR transcriptions performs slightly better than • one trained on clean transcription and tested on ASR output • Mismatch between Ref. and ASR word lists & systematic substitutions

Conclusions (1) • Task: • Automatic recognition of five broad DA categories: • Statements, questions, back-channels, fillers and disruptions • Approach: • A switching DBN based infrastructure(Bayesian multi-net) oversees the DA recognition process (joint segmentation and classification) and integrates an heterogeneous set of technologies: • Feature based DA segmentation • Factored Language Model for DA classification • N-gram DA discourse model (3-gram) • The graphical infrastructure encourages the reuse of common resources (like the discourse model and the word counters) and learns the optimal recognition strategy from data without the need for external supervision • The joint approach operates on a wide search space

Conclusions (2) • Results: • Small gap between FLM based DA tagging and maximum entropy DA classification [Ang et al. 2005] • The concurrent evaluation of multiple DA segmentation + tagging hypotheses (Joint approach) provides low recognition error rates… • … and seems to cope well with imperfect word transcriptions: 29% WER on the ASR output causes less than 10% of degradation on the DA recognition output • Further directions: • WiP on AMI meeting corpus (17 DA classes) • Tuning of FLM and investigation of new factors • Experiments with multimodal features • Integration of automatic DA recognition into the “meeting action detection framework”

Recognition of Dialogue Acts in Meetings