Presented by: Mark Hepple

DQA meeting: 17.07.2007:Learning more effective dialogue strategies using limited dialogue move featuresMatthew Frampton & Oliver Lemon,Coling/ACL-2006 Presented by: Mark Hepple

Data-driven methodology for SDS Development • Broader context = realising a “data-driven” methodology for creating SDSs, with following steps: • 1. Collect data (using prototype or WOZ) • 2. Build probabilistic user simulation from data (covering user behaviour, ASR errors) • 3. [Feature selection - using USimulation] • 4. Learn dialog strategy, using reinforcement learning over system interactions with simulation

Task • Information seeking dialog systems • Specifically task-oriented, slot-filling dialogs, leading to database query • E.g. getting user requirements for a flight booking (c.f. COMMUNICATOR task) • Aim is to achieve effective system strategy for such dialog interactions

Reinforcement Learning • System modelled as a Markov Decision Process (MDP) • model decision making in situations where outcomes partly random, partly under system control • Reinforcement learning used to learn an effective policy • determines best action to take in each situation • Aim is to maximize overall reward • need a reward function, assigning reward value of different dialogs

Action set of dialog system • 1. Ask open qu. (how may I help you?) • 2. Ask value for slot 1..n • 3. Explicitly confirm a slot 1..n • 4. Ask for slot k, whilst implicitly confirm slot k-1 or k+1 • 5. Give help • 6. Pass to human operator • 7. Database query

Reward function • Reward function is “all-or-nothing”: • 1. DB query, all slots confirmed = +100 • 2. Any other DB query = -75 • 3. Usimulation hangs up = -100 • 4. System passes to human operator = -50 • 5. Each system turn = -5

N-Gram User Simulation • Employs n-gram user simulation of Georgila, Lemon & Henderson: • Derived from annotated version of COMMUNICATOR data • Treat dialog as sequence of pairs of DAs/tasks • Output next user “utterance” as DA/task pair, based on last n-1 pairs • Incorporate effects of ASR errors (built from user utts as recognised by ASR components of original COMMUNICATOR systems) • Have separate 4- and 5-gram simulations: used for training/testing

Key Question: what context features to use • Past work - has used on limited state information • Based on number/fill-status of slots • Proposal: include richer context information, specifically • Dialog act (DA) of last system turn • DA of last user turn

Experiments • Compare 3 systems • Baseline: slot features only • Strat 2: slot features + last user DA • Strat 3: slot features + last user + system Das • Train with 4-gram, test with 5-gram USim, and vice versa

Results • Main reported result is improvement in average reward level of dialogs for strategies, compared to baseline • Str-2 improves over baseline by 4.9% • Str-3 improves over baseline by 7.8% • All 3 strategies achieve 100% slot filling and confirmation • Augmented strategies also improve over baseline w.r.t. average dialogue length

Qualitative Analysis • Learns to: • Only query DB when all slots filled • Not pass to operator • Use implicit confirmation where possible • Emergent behaviour: • When baseline system fails to fill/confirm slot from user input, state remains same, and system will repeat same action • For augmented systems, ‘state’ changes, so can learn to do different action, e.g. ask about a different slot, or use “give help” action

Questions/Comments • Value of performance improvement figures based on reward? • Does improvement w.r.t. reward function -> improvement for human-machine dialogs • Validity of comparisons to COMMr systems • Why does system performance improve? • Is avoidance of repetition the key?

Presented by: Mark Hepple