1 / 22

Shaping in Speech Graffiti: results from the initial user study

Shaping in Speech Graffiti: results from the initial user study. Stefanie Tomko Dialogs on Dialogs meeting 10 February 2006. Big picture ( i.e. thesis statement). A system of shaping and adaptivity can be used to induce more efficient user interactions with spoken dialog systems.

hosanna
Download Presentation

Shaping in Speech Graffiti: results from the initial user study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shaping in Speech Graffiti: results from the initial user study Stefanie Tomko Dialogs on Dialogs meeting 10 February 2006

  2. Big picture (i.e. thesis statement) • A system of shaping and adaptivity can be used to induce more efficient user interactions with spoken dialog systems. • This strategy can increase efficiency by increasing the amount of user input that is actually understood by the system, leading to increased task completion rates and higher user satisfaction. • This strategy can also reduce upfront training time, thus accelerating the process of reaching optimally efficient interaction.

  3. This study User input shapeable? (expanded) {confsig} Speech Graffiti? (target) no no yes result shaping prompt yes

  4. My approach, graphically User input shapeable? (expanded) intelligent shaping help Speech Graffiti? (target) no no yes result shaping prompt yes

  5. Speech Graffiti • Standardized framework of syntax, keywords, and principles • Domain-specific vocabulary Theater is Showcase North Theater Showcase Cinemas Pittsburgh North Genre is drama Drama What movies are playing? {confsig} [an error beep, since previous utterance is not in grammar] WHERE WAS I? Theater is Showcase Cinemas Pittsburgh North, genre is drama OPTIONS You can specify or ask about title, show time, rating, {ellsig} [a 3-beep list continuation signal] What is title? 2 matches: Dark Water, War of the Worlds START OVER Starting over Theater is Northway Mall Cinemas Eight Northway Mall Cinemas 8 What is address? 1 match: 8000 McKnight Road in Pittsburgh

  6. Expanded grammar • Exploit the fact that knowledge of speaking to a limited-language system restricts input • Create a grammar that will accept more natural language input cf. SG • This grammar is opaque for users • Why have two grammars? • Lower perplexity LMs  lower error rates • Some applications may be SG-only • Restriction: linear mapping from EXP input to TGT equivalent

  7. Shaping strategy • Handle user input accepted by expanded grammar but not target • Balance current task success with future interaction efficiency • Baseline strategy – this study: • Confirm expanded grammar input with full, explicit slot+value confirmation • Give result if appropriate for query

  8. Study participants • “Normal” adults, i.e. not CMU students • 15 males, 14 females, aged 23-54 • Native speakers of American Eng. • Little/no computer programming exp • New to Speech Graffiti

  9. Study design • Between-subjects • 3 conditions • non-shaping+tutorial (BT) • shaping+tutorial (ST) • shaping+no_tutorial (SN) • Tutorial • 9-slide .ppt presentation • 5 minutes

  10. Study tasks • 15 tasks • 4 difficulty levels • # of slots to be specified/queried • 40 minutes or when all tasks completed • Only one user did not get to attempt all 15 tasks in 40 minutes • Afterwards: SASSI questionnaire

  11. Results • In short, the baseline shaping strategy didn’t have an effect  • Efficiency • Mean results from shaping subjects are only slightly better – non-significant

  12. User satisfaction • Again, no significant differences • No differences on individual SASSI factors • No efficiency/satisfaction differences between tutorial/non-tutorial, either

  13. Grammaticality • How often did users speak within the Target SG grammar? • From Q1 to Q4, both groups showed significant increases in TGT gram

  14. Error rates - WER • For non-shaping: 39.9% • 30.3% for grammatical utts • 38.3% utt-level concept error • For shaping: a bit harder to figure, because of 2-pass ASR • Each shaping input generated a TGT hyp & a EXP hyp • Selection based on AM/LM score and a few simple heuristics

  15. Error rates – WER • Shaping: • For selected hypothesis: 37.3% • All TGT: 40.9% • All EXP: 64.2% • 25.6% utt-level concept error

  16. So – what happened? • Shaping users had success with NL-ish input, and shaping prompts were not strong enough to change behavior.

  17. Biggest problem • Using NL or slot-only query formats • My theory: <slot> is <value> specification format is very structured. • what is <slot> sounds structured to me, but to users it sounds like <just ask a question!> • In new versions, query format will be list <slot> • Users don’t seem to have too much trouble adapting to a structure – but the structure needs to be clear. • Will also shape more explicitly by confirming with “I think you meant, ‘list movies’” • Also for more explicit shaping of specifications

  18. Other problems • Not using start over to clear context • Confusion about semantics of location • Long utterances • Using next instead of more • Pacing • These will be addressed via targeted help messages

  19. Current hang-up • Can we improve WER? • LM improvements? • COTS recognizer? • Dragon: • Using • Results • Issues

  20. A little bit about trying DNS • Dragon Naturally Speaking 8 • Distribution from Jahanzeb • Set up for dictation – i.e. mic input • So, no telephone models • To compare with Sphinx • Test set of utterances from this study • Rerecorded with head mic (so, read) at 16kHz • Downsampled to 8kHz for Sphinx

  21. More Dragon stuff • Two groups • TGT • Sphinx mean 56.4% ( • Worse than 8k telephone model (?) • Dragon mean 35.9% • Mean diff: Dragon 18.8pts less (ns) • EXP • Sphinx mean 68.5% • Dragon mean 45.4% • Mean diff: Dragon 22.3pts less (s)

  22. More Dragon stuff • But – Dragon rates are not that different from original Sphinx WER rates • Sphinx WER in this test might be fishy • Setup seems tricky – can I still do 2-pass decoding? • Would need to change to mic setup • Black-box LM stuff • Mysterious adaptation? – not good for user studies! • So, sticking with Sphinx.

More Related