260 likes | 426 Views
Identifying Agreement and Disagreement in Conversational Speech: Use of Bayesian Networks to Model Pragmatic Dependencies. Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg SRI International. Motivation.
E N D
Identifying Agreement and Disagreement in Conversational Speech: Use of Bayesian Networks to Model Pragmatic Dependencies Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg SRI International
Motivation • Problem: identification of agreements and disagreements between participants in meetings. • Ultimate goal: automatic summarization. This enables us to generate “minutes” of meetings highlighting the debate that affected each decision.
Example 4-way classification: AGREE,DISAGREE, BACKCHANNEL, OTHER
Previous work • Decision-tree classifiers [Hillard et al. 03] • CART-style tree learner. • Features local to the utterance: lexical, durational, and acoustic. • Reasonably good accuracy in a 3-way classification (AGREE,DISAGREE, OTHER): • 71% with ASR output; • 82% with accurate transcription.
Extend [Hillard et al. 03] by investigating the effect of context • Empirical questions: • Are preceding agreements/disagreements good predictors for the classification task? • Does the current label (agreement/disagreement) depend on the identity of the addressee? • Should we distinguish preceding labels by the identity of their corresponding addresses? • Studies we report on show that preceding context supplies good predictors. • Addressee identification is instrumental to analyzing preceding context.
Agreement/disagreement classification in two steps • Addressee identification • Large corpus of labeled adjacency pairs (AP), labeled paired utterances A and B • e.g. question-answer, offer-acceptance, apology-downplay • Train a system to determine who is the addressee (A-part) of any given utterance (B-part) in a meeting. • Agreement/disagreement classification • Features local to the utterance and pertaining to immediately preceding speech and silences. • Label-dependency features: dependencies between current label (agree, disagree, …) and previous labels in a Bayesian network. • Addressee identification defines the topology of the Bayesian network.
Corpus annotation • ICSI meeting corpus: 75 informal meetings recorder at UC Berkeley, averaging one hour, and ranging from 3 to 9 participants. • Adjacency pair annotation:[Dhillon et al. 04] • All 75 meetings labeled with dialog acts and adjacency pairs. • Agreement/disagreement annotation:[Hillard et al. 03] • Annotation of 4 meeting segments plus tags for 4 additional meetings obtained with a clustering method [Hillard et al. 03] • 8135 labeled utterances: 11.9% agreements 6.8% disagreements 23.2% backchannels 58.1% other • Inter-labeler reliability: kappa coefficient of .63
Step 1: Addressee (AP) identification • Baseline algorithm: • always assume that the addressee in an adjacency pair (A,B) is the party who spoke last before B. • Works reasonably well: 79.8% accuracy. • Our method: speaker ranking • rank all speakers S = (s1,…,sN) with probabilities reflecting how likely they are to be speaker A (i.e. the addressee). • Log-linear (maximum entropy) probability model for ranking: • di inD = (d1,…,dN) are observations pertaining to speaker siand to the last utterance of speakersi
Features for AP identification • Structural: • number of speakers taking the floor between A and B we match the baseline with this single feature (79.8%) • Durational features: • duration of A short utterances generally do not elicit responses/reactions • seconds of overlap with any other speaker competitive speech incompatible with AP construction • Lexical features: • number of n-grams both in A and B (uni- to trigrams) A and B parts often have some words in common • first word of A to exploit cue words, detect wh- questions • Is the B speaker (addressee) named explicitly in A?
Adjacency pairs identification: results • Experimental setting: 40 meetings used for training (9104 APs), 10 meetings used for testing (1723 APs) 5 meetings of an held-out set used for forward feature selection and regularization (Gaussian smoothing)
Step 2: Agreement/disagreement classification:local features of the utterance • Local features of the utterance include the ones used in [Hillard et al. 03] (but no acoustics). Best predictors: Lexical features: • agreement and disagreement markers [Cohen, 02], adjectives with positive/negative polarity [Hatzivassiloglou and McKeown, 97], general cue phrases [Hirschberg and Litman, 94]. • first word of the utterance • score according to four LM (one for each class). Structural and durational features: • duration of the utterance • speech rate
Label dependencies in sequence classification • Previous-tag feature p(ci|ci-1) helpful in many NLP application to model context: POS tagging, supertagging, dialog act classification. • Various families of Markov models to train (e.g. HMMs, CMMs, CRFs). • Limitations of fixed-order Markov models for representing of multi-party conversations: • overlapping speech; no strict label ordering • multiple speakers, with different opinions: previous tag (speaker A) might affect current tag (speaker B addressing A), but unlikely if B addresses C.
Label dependency: previous-tag • Intuition: previous tag affects current tag A speaking to B tag index (BACKCHANNEL tags ignored for better interpretability)
Label dependency: same-interactants previous tags • Intuition: If A disagreed with B(when A last spoke to B), then A is likely to disagree with B again.
Label dependency: symmetry • Intuition: If B disagreed with A(when B last spoke to A), then A is likely to disagree with B.
Label dependency: transitivity • Intuition: If Adisagrees with C after C agreed with B, then we might expect A to disagree with B as well.
Parameter estimation • We use (dynamic) Bayes nets to factor the conditional probability distribution: C = (c1,…,cL) : sequence of labels D = (d1,…,dL) : sequence of observations pa(ci) : parents of ci, i.e. label dependencies as in: • (Maximum entropy) log-linear model used to estimate the probability of the dynamic variable ci:
Decoding of the maximizing sequence • Beam search • Maintain a beam of B most likely left-to-right partial sequences (as in [Ratnaparkhi 96] for POS tagging). • In theory, possible search errors. • Practically, our search is seldom affected by beam size if isn’t too small: B=100 is a reasonable value for any sequence.
Results: comparison to previous work • 3-way classification (AGREE, DISAGREE, OTHER) as in [Hillard et al, 03]; priors are normalized. • Best performing feature set represents a 27.3% error reduction over [Hillard et al, 03].
Results: comparison to previous work • 3-way classification (AGREE, DISAGREE, OTHER) as in [Hillard et al, 03]; priors are normalized. • Label dependency features reduce error by 9%.
Results: 4-way classification • 6-fold cross-validation, each fold on one meeting, representing a total of 8135 utterances to classify. • Label dependencies contribution on different feature sets:
Results: 4-way classification • Accuracies by label dependency type (assuming all other features – structural, durational, lexical - are used):
Conclusion and future work • Conclusion: • Performed addressee identification as a byproduct of agreement/disagreement classification. • AP identification: significantly outperform a competitive baseline. • Compelling evidence that models that incorporate label dependency features are superior. • Future work: • Summarization: identification of what propositional content was agreed or disagreed. • Addressee identification may also be beneficial in DA labeling of multi-party speech.