330 likes | 511 Views
LREC2010: O3 - Dialogue and Evaluation. Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System. Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya.takeda}@nagoya-u.jp.
E N D
LREC2010: O3 - Dialogue and Evaluation Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System Sunao Hara, Norihide Kitaoka, Kazuya Takeda {naoh, kitaoka, kazuya.takeda}@nagoya-u.jp Graduate School of Information Science,Nagoya University, Japan
Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Introduction • The aim of this study • Construct an estimation model of user satisfaction for spoken dialog systems (SDSs) based on the realPC environment data • Experiment • Field experiment using a SDS for the music retrieval application • Construct and evaluate an estimation model for user satisfaction using N-gram history model LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Background (1/2) • Use of speech input applications (e.g. Skype)by PC users is spreading • More users may use Spoken Dialog Systems (SDSs)via the Internet • The acoustic properties of PC environments differ among users • e.g. microphones, noise conditions, etc. • From a practical application standpoint • Evaluation and prediction of the system performance (User Satisfaction) are also important issues Collect the speech under realistic PC environment Build an estimation model for User Satisfaction LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Background (2/2) • The evaluation using automatically measured metrics • Tune up the system parameters in the designing stage • Use to select the best dialog strategy for SDS applications • PARADISE Framework [Walker, et al. 1997] • The detection of problematic dialog for call center Interactive Voice Response (IVR) systems • To detect that “the conversation will break down”, as soon as possible • Problematic dialog predictor using SLU-success feature [Walker, et al. 2002] • N-gram-based call quality monitoring system [Kim 2007] Spoken Language Understanding Can we estimate the user satisfaction of SDSby modeling the dialog context? LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion MusicNavi2 database • Field experiment using a musicretrieval system with spoken dialog interface 1. Download the system through the Internet 2. Use it for a certain period 3. Fill in questionnaires on the web page • Music retrieval system - MusicNavi2 • “Music retrieval application” + “Spoken dialog interface” • The spoken dialogue interface for retrievingand playing songs stored in user’s PC • Can collect speech data in corporation with a server program via the Internet LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Example of a dialog U = User S = System LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Data collection by the field test • Large scaled field test through the Internet • Subjects used MusicNavi2 on their own PC • Participants: 1369 subjects • Total of usage: 488 hours • User’s task • To listen to at least five songs • To perform at least twenty Q&A dialogs, or to use the system for over forty minutes • Questionnaire (only by “task complete” users) • Satisfaction level for SDS from 1 to 5 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Distributions of the experimental subjects and the equipments used by them • Subjects who answered questionnaires • 449 Subjects (278 males and 171 females) • Total 34296 utterances Microphone Loudspeaker / headphone LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Overview of the MusicNavi2 database Word Error Rate Utterancesper song played # of utterances LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Pre-analysis of the MusicNavi2 database • Classification of users by their satisfaction level • “task complete” users : c = 1, 2, 3, 4, 5 • “task incomplete” users: c = ϕ • Summary of data • Total 518 subjects LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Modeling method for the dialog context • The dialog management of SDS isdesigned by a dialog developer • The management is not always satisfactory for users • Assume that satisfaction appears in the dialog context • Statistically learning the naturalness of the dialog • Use N-gram to model the dialog context • Construct models for each class of users • Estimate the unknown user’s satisfaction based on the likelihood of N-gram model LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Spoken dialog logs to Dilaog act symbols • Vocabulary size of the recognition dictionary • That is, the number of the songs • Is different between the users • Word level information is informative, but it is too sparse to deal with as statistically • Use dialog act symbols for the users’/system’s acts • Defined 21 system dialog acts and 19 user dialog acts LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Example of an encoded dialog U = User S = System LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Modeling the dialog act sequence by N-gram • A dialog act sequence: • arranged the dialog act symbols in time order t. • N-gram probability (= likelihood) when given a model for a user class c LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Estimation experiment • Detection of the user’s classusing N-gram model • Experimental conditions • N-gram: 1-gram, 2-gram, …, 8-gram • Witten-Bell smoothing (using SRILM toolkit) • Input sequence: USR, SYS, SYSUSR • Leave-one-out cross validation Exp.1: “task incomplete” users • Exp.2: “unsatisfied” users LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Estimation experiment • Detection method • Model selection by thresholding the likelihood ratio • Evaluation metrics • ROC curve • Area under the ROC curve (AUC) 1 true detection 0 false detection 1 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
AUC (Area under the ROC curve) • “task incomplete” users • “unsatisfied” users Suggested the effectivity of using both system and user dialog acts High detection performance in “task incomplete” users to use the system dialog acts LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Detection result of “task incomplete” users • SYSUSR 4-gram achieved100% true detection ratewith 6% false detection rate LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Detection result of “unsatisfied” users • SYSUSR The more N of N-gram is, the less false detection rate becomes LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion Conclusion • Estimation method of user satisfactionusing N-gram-based dialog history model for SDS • Constructed the real PC environmental database • Achieved high performance in the detection of “task incomplete” users • 100% true detection rate, when 6% false detection rate • Not sufficient performance in the detection of “unsatisfied” users • N-gram model was effective by comparison of 1-gram • Using both system and user dialog act was effective • Future works • N-gram model-based estimation of dialog failure (online detection) • Analysis of the dialog context affected user satisfaction • Integrated method of using acoustic features, prosodic features, dialog features, etc. LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Thanks for your kind attention! LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Modeling the dialog act sequence by N-gram • Encoded dialog logs to dialog act symbols automatically • A dialog act sequence: x • arranged the dialog act symbols in time order t. • N-gram probability(=Likelihood) when given a model with a satisfaction level s LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Detection by thresholding • Model selection by an a posteriori odds classifier, • Introduce a priori odds 1/α and Bayes factor B • Finally, * α =1 means ML classifier LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
6-クラスの満足度推定実験 • N-gramモデルを用いたユーザ満足度クラスの推定 • 実験条件 • 評価用被験者1名を除いた残り517名を利用して満足度毎のモデルを学習(Leave one out) • 満足度 s = ϕ (課題未達成), 1(不満), 2, 3, 4, 5(満足) • N-gram: 1-gram, 2-gram, …, 8-gram • 入力系列 • ユーザの対話行動のみを利用(USR) • システムの対話行動のみを利用(SYS) • ユーザ・システム両者の対話行動を利用(USRSYS) • 評価基準 • 識別精度(Accuracy) LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
満足度(6-クラス)の推定手法 • 最尤推定による最尤モデルの選択 • あるユーザの入力 x に対して満足度モデルそれぞれの尤度を算出 • 最大尤度のモデルが推定結果 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Detection result for 6-classes of satisfaction システム系列のみを利用、3-gramの場合で 34.4% LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Confusion matrix • 3-gram of SYS sequence 課題未達成ユーザ(Φ)は 誤検出も少なく、比較的高い精度で識別されている Actual 満足しているユーザも 推定結果が大きく異なっている例は少ない LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
対話履歴を考慮したユーザ満足度 • システムとの対話を繰り返すことでユーザの感じる満足度合いが変化 • 逐次変化の最後に“満足度”が調査される 性能に満足 性能に不満 不満← →満足 利用を中断 対話ターン数 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Modeling the N-gram • Encoded to dialog log to dialog act symbols automatically • User’s dialog acts • Using speech recognition results • They are defined in recognition dictionary • System’s dialog acts • Using system responses or acts • They are the same as system’s internal act • A dialog act sequence:x • Arranged the dialog act symbols in time order t. • 6クラスの満足度毎にN-gramモデルを作成 • Witten-Bell smoothing … SRILM toolkit を利用 LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Example of a dialog U = User S = System LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction • Musicnavi2 database • N-gram modeling • Estimation experiment • Conclusion LREC2010: Sunao HARA et al., Nagoya Univ., Japan.