1 / 33

Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya.takeda}@nagoya-u.jp

LREC2010: O3 - Dialogue and Evaluation. Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System. Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya.takeda}@nagoya-u.jp.

Download Presentation

Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya.takeda}@nagoya-u.jp

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LREC2010: O3 - Dialogue and Evaluation Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya.takeda}@nagoya-u.jp Graduate School of Information Science,Nagoya University, Japan

  2. Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Introduction • The aim of this study • Construct an estimation model of user satisfaction for spoken dialog systems (SDSs) based on the realPC environment data • Experiment • Field experiment using a SDS for the music retrieval application • Construct and evaluate an estimation model for user satisfaction using N-gram history model LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  3. Background (1/2) • Use of speech input applications (e.g. Skype)by PC users is spreading • More users may use Spoken Dialog Systems (SDSs)via the Internet • The acoustic properties of PC environments differ among users • e.g. microphones, noise conditions, etc. • From a practical application standpoint • Evaluation and prediction of the system performance (User Satisfaction) are also important issues Collect the speech under realistic PC environment Build an estimation model for User Satisfaction LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  4. Background (2/2) • The evaluation using automatically measured metrics • Tune up the system parameters in the designing stage • Use to select the best dialog strategy for SDS applications • PARADISE Framework [Walker, et al. 1997] • The detection of problematic dialog for call center Interactive Voice Response (IVR) systems • To detect that “the conversation will break down”, as soon as possible • Problematic dialog predictor using SLU-success feature [Walker, et al. 2002] • N-gram-based call quality monitoring system [Kim 2007] Spoken Language Understanding Can we estimate the user satisfaction of SDSby modeling the dialog context? LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  5. Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion MusicNavi2 database • Field experiment using a musicretrieval system with spoken dialog interface 1. Download the system through the Internet 2. Use it for a certain period 3. Fill in questionnaires on the web page • Music retrieval system - MusicNavi2 • “Music retrieval application” + “Spoken dialog interface” • The spoken dialogue interface for retrievingand playing songs stored in user’s PC • Can collect speech data in corporation with a server program via the Internet LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  6. Example of a dialog U = User S = System LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  7. Data collection by the field test • Large scaled field test through the Internet • Subjects used MusicNavi2 on their own PC • Participants: 1369 subjects • Total of usage: 488 hours • User’s task • To listen to at least five songs • To perform at least twenty Q&A dialogs, or to use the system for over forty minutes • Questionnaire (only by “task complete” users) • Satisfaction level for SDS from 1 to 5 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  8. Distributions of the experimental subjects and the equipments used by them • Subjects who answered questionnaires • 449 Subjects (278 males and 171 females) • Total 34296 utterances Microphone Loudspeaker / headphone LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  9. Overview of the MusicNavi2 database Word Error Rate Utterancesper song played # of utterances LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  10. Pre-analysis of the MusicNavi2 database • Classification of users by their satisfaction level • “task complete” users : c = 1, 2, 3, 4, 5 • “task incomplete” users: c = ϕ • Summary of data • Total 518 subjects LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  11. Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Modeling method for the dialog context • The dialog management of SDS isdesigned by a dialog developer • The management is not always satisfactory for users • Assume that satisfaction appears in the dialog context • Statistically learning the naturalness of the dialog • Use N-gram to model the dialog context • Construct models for each class of users • Estimate the unknown user’s satisfaction based on the likelihood of N-gram model LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  12. Spoken dialog logs to Dilaog act symbols • Vocabulary size of the recognition dictionary • That is, the number of the songs • Is different between the users • Word level information is informative, but it is too sparse to deal with as statistically • Use dialog act symbols for the users’/system’s acts • Defined 21 system dialog acts and 19 user dialog acts LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  13. Example of an encoded dialog U = User S = System LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  14. Modeling the dialog act sequence by N-gram • A dialog act sequence: • arranged the dialog act symbols in time order t. • N-gram probability (= likelihood) when given a model for a user class c LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  15. Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Estimation experiment • Detection of the user’s classusing N-gram model • Experimental conditions • N-gram: 1-gram, 2-gram, …, 8-gram • Witten-Bell smoothing (using SRILM toolkit) • Input sequence: USR, SYS, SYSUSR • Leave-one-out cross validation Exp.1: “task incomplete” users • Exp.2: “unsatisfied” users LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  16. Estimation experiment • Detection method • Model selection by thresholding the likelihood ratio • Evaluation metrics • ROC curve • Area under the ROC curve (AUC) 1 true detection 0 false detection 1 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  17. AUC (Area under the ROC curve) • “task incomplete” users • “unsatisfied” users Suggested the effectivity of using both system and user dialog acts High detection performance in “task incomplete” users to use the system dialog acts LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  18. Detection result of “task incomplete” users • SYSUSR 4-gram achieved100% true detection ratewith 6% false detection rate LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  19. Detection result of “unsatisfied” users • SYSUSR The more N of N-gram is, the less false detection rate becomes LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  20. Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Conclusion • Estimation method of user satisfactionusing N-gram-based dialog history model for SDS • Constructed the real PC environmental database • Achieved high performance in the detection of “task incomplete” users • 100% true detection rate, when 6% false detection rate • Not sufficient performance in the detection of “unsatisfied” users • N-gram model was effective by comparison of 1-gram • Using both system and user dialog act was effective • Future works • N-gram model-based estimation of dialog failure (online detection) • Analysis of the dialog context affected user satisfaction • Integrated method of using acoustic features, prosodic features, dialog features, etc. LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  21. Thanks for your kind attention! LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  22. LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  23. Modeling the dialog act sequence by N-gram • Encoded dialog logs to dialog act symbols automatically • A dialog act sequence: x • arranged the dialog act symbols in time order t. • N-gram probability(=Likelihood) when given a model with a satisfaction level s LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  24. Detection by thresholding • Model selection by an a posteriori odds classifier, • Introduce a priori odds 1/α and Bayes factor B • Finally, * α =1 means ML classifier LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  25. 6-クラスの満足度推定実験 • N-gramモデルを用いたユーザ満足度クラスの推定 • 実験条件 • 評価用被験者1名を除いた残り517名を利用して満足度毎のモデルを学習(Leave one out) • 満足度 s = ϕ (課題未達成), 1(不満), 2, 3, 4, 5(満足) • N-gram: 1-gram, 2-gram, …, 8-gram • 入力系列 • ユーザの対話行動のみを利用(USR) • システムの対話行動のみを利用(SYS) • ユーザ・システム両者の対話行動を利用(USRSYS) • 評価基準 • 識別精度(Accuracy) LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  26. 満足度(6-クラス)の推定手法 • 最尤推定による最尤モデルの選択 • あるユーザの入力 x に対して満足度モデルそれぞれの尤度を算出 • 最大尤度のモデルが推定結果 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  27. Detection result for 6-classes of satisfaction システム系列のみを利用、3-gramの場合で 34.4% LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  28. Confusion matrix • 3-gram of SYS sequence 課題未達成ユーザ(Φ)は 誤検出も少なく、比較的高い精度で識別されている Actual 満足しているユーザも 推定結果が大きく異なっている例は少ない LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  29. 対話履歴を考慮したユーザ満足度 • システムとの対話を繰り返すことでユーザの感じる満足度合いが変化 • 逐次変化の最後に“満足度”が調査される  性能に満足  性能に不満 不満←    →満足  利用を中断 対話ターン数 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  30. LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  31. Modeling the N-gram • Encoded to dialog log to dialog act symbols automatically • User’s dialog acts • Using speech recognition results • They are defined in recognition dictionary • System’s dialog acts • Using system responses or acts • They are the same as system’s internal act • A dialog act sequence:x • Arranged the dialog act symbols in time order t. • 6クラスの満足度毎にN-gramモデルを作成 • Witten-Bell smoothing … SRILM toolkit を利用 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  32. Example of a dialog U = User S = System LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

  33. Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion LREC2010: Sunao HARA et al., Nagoya Univ., Japan.

More Related