1 / 15

Exploiting lexical information for Meeting Structuring

This study aims to recognize events in meetings involving different communicative modalities and integrate lexical information into meeting structuring using Dynamic Bayesian Network models.

dnolan
Download Presentation

Exploiting lexical information for Meeting Structuring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) { a.dielmann@ed.ac.uk , s.renals@ed.ac.uk }

  2. Meeting Structuring (1) • Goal: recognise events which involve one or more communicative modalities: • Monologue / Dialogue / Note taking / Presentation / Presentation at the whiteboard • Working environment: “IDIAP framework”: • 30+23 five minutes long meetings of 4 participants • 4 audio derived features: • Speaker turns (derived from mic. Array localisation) • Prosodic Features: RMS energy, F0, Rate Of Speech • But we’d like to integrate other features…

  3. Meeting Structuring (2) • We’re working with Dynamic Bayesian Network based models and in the previous meeting we proposed two models: • The first one is characterised by: • Decomposition of actions {At} • into “sub-actions” {St} • Early features integration …. …. C0 C0 C0 E0 E0 E0 A0 At At+1 …. …. Counter structure (Reduces insertions number) S01 St1 St+11 …. …. Y01 Yt1 Yt+11

  4. Meeting Structuring (3) • The second model extends the previous one, through multi-stream processing (avoiding early integration): • Different feature groups are • processed independently • Parallel independent HMM • chains are responsible only for • one part of the feature set • Cardinalities of {Stn} are part • of the model • Hidden sub-states {Stn} are a • result of the training process …. …. C0 C0 C0 E0 E0 E0 A0 At At+1 …. …. S01 St1 St+11 …. …. S02 St2 St+12 …. Y01 Yt1 Yt+11 Y02 Yt2 Yt+12

  5. Further developments The feature set adopted has to be extended!! • Integration of gestures/body movements (VIDEO) • Integration of lexical information (ASR) The problem presents some analogies with “Topic Detection and Tracking” Correlate the sequence of transcribed words with the sequence of “meeting actions” Work In Progress ! Discover homogeneous partitions in a transcription, according to the communicative phase of the meeting (Dialogue,Monologue,..)

  6. TDT approaches • Lexical cohesion based, like “TextTiling”: • Given two adjacent windows a lexical cohesion function is evaluated at the transition between the 2 windows, in order to find topically coherent passages and highlight topic boundaries candidates • Feature based, reducing topic segmentation to a statistical classification problem of : • Lexical features: perplexity, mutual information, other information content measures • Cue phrases (word frequent on topic changes) • Short/long range language models: n-gram, -binomial distribution • Prosodic features: pauses, F0, Cross-talk, speaker changes, … • Mixed approach (Lexical cohesion + feature classification)

  7. Feature based approach • Lexical feature based approach could be: • Easily transferred from the TDT problem to the meeting segmentation one • Quickly integrated with proposed DBNs models • Mutual Information • Investigate a lexical function that • discriminates between different • communicative phases • Look for a list of cue-phrases that • highlights “Meeting Actions” • boundaries • …… • Could be interesting starting • points for further experiments! “the amount of information that one random variable X contains about another variable Y”

  8. Problems (1) • Challenging conditions: spoken, multiparty dialogues, in unrestricted domain • Insufficient training/testing data • Especially for Agreement\Disagr. (4% of the corpus) We must cope with speech recognition errors!! Only 30 meetings are fully transcribed (~25k words) These first experiments are based on hand labelled transcriptions, and attempt to discriminate only between Monologue & Dialogue

  9. Lexical classification (1) ASR transcript ….. …….. …. Each word of the testing corpus is compared with every “Meeting Action” lexical model, and then classified …. Monologue Model Dialogue Model xxxxx Model X …. maximizing the Mutual Information MAX Output Filter

  10. Lexical classification (2) • Each recognised word is classified as a “Monologue word” or a “Dialogue word” • This stream of symbols is then filtered (de-noised) : • Considering a moving window • Estimating the the temporal density for each class (Monologue, Dialogue) • The class with higher symbol density (frequency) is the winning one We expect that during a “Dialogue act”, the temporal density of “Monologue words” is lower than the “Dialogue words” one

  11. Initial results • We evaluated(*) the proposed system using two different classification criteria: Correct classification percentage: Achieved results are very close !! Mutual information seems to be more efficient than the simple 3-gram language model (*) Using 13 meeting to construct Monologue&Dialogue lexical models and remaining 17 to evaluate performances

  12. Integration (1) • The next step is to integrate these results into the previously adopted framework, therefore we assume that: • Lexical classifier output can be seen as a new independent feature, and combined with Speaker Turns and prosodic features • Developed models could be easily adapted (and eventually re-engineered) to support newly introduced features Analysing a further communicative modality Thanks to the flexibility of Dynamic Bayesian Networks

  13. Problems (2) • The new lexical feature stays on a new different time-scale • Different from both Speaker Turns and Prosodic feature time-scales • And different from the time-scale of the events that we’d like to recognise • Meeting events usually appear on different modalities (i.e. turn-taking, prosody and word lexical environment), without a precise synchronism • Speech and gestures for example • Some features are asynchronous because calculated for each participant • Other derive from the interaction of different participants (Speaker Turns) A true multi-time scale model probably is more compact and more efficient !? The model must provide at least a minimum degree of a-synchronicity

  14. Integration (2) • Soon (hopefully!) the new feature will be integrated into the Multistream model • The model will be adapted in order to support multiple time-scales, at least at a feature level • The lack of synchronism between features will be investigated, and verified in which measure proposed models are able to manage it At A0 At At …. …. …. St1 S01 St1 St1 …. …. …. S02 …. St2 …. St2 S02 St2 St2 …. …. …. Yt1 Y01 Yt1 Yt1 Y02 Yt2 Yt3 Y03 Yt3 Yt3

  15. Summary • Open problems: • Choose the best way to process lexical data • Integration in the existing framework • More data (transcriptions) are needed • Multi-modal = Multi-time-scale • Synchronism Suggestions ?

More Related