240 likes | 335 Views
NESPOLE! Project Status Carnegie Mellon University. Grenoble Meeting November 15, 2001. Main Accomplishments: Nov-01. Improved DACPar Analyzer Improved SR engines Port to Linux HLT servers Significant coverage improvements Formal evaluation (SPECTRUM Proposal). The DACPar Analyzer.
E N D
NESPOLE! Project StatusCarnegie Mellon University Grenoble Meeting November 15, 2001
Main Accomplishments: Nov-01 • Improved DACPar Analyzer • Improved SR engines • Port to Linux HLT servers • Significant coverage improvements • Formal evaluation • (SPECTRUM Proposal)
The DACPar Analyzer • Parse an utterance for arguments (SOUP) • Segment the utterance into sentences • Extract features from the utterance and the single best parse output • Use a learned classifier to identify the speech act (TiMBL) • Use a learned classifier to identify the concept sequence (TiMBL) • Combine into a full parse
Improved DACPar Analyzer • Improved segmentation of utterances into SDUs • Using IF well-formedness constraints to improve overall DA classification • coverage and training set improvements
DACPar - Improved Segmentation • Segmenting single turns into DA units (SDUs) - two problems: • under-segmentation: detecting SDU boundaries between parsed arguments • over-segmentation: due to CrossDomain grammar - single SDUs that are incorrectly split • New segment boundary detector implemented based on argument statistical model • CrossDomain grammar tuned to prevent over-segmentation
DACPar - Using IF Constraints • Two goals: • Ensure that resulting DA analysis is a legal IF • Improve classification outcome using the well-formedness constraints from IF spec • Classifier produces ranked list of Das • Select highest ranking DA that licenses the greatest number of arguments (ideally all)
DACPar - Initial Results • SA classification accuracy ~65% • DA classification accuracy ~45% • Eng-to-Eng translation (from trans) 58% (43%) • Eng-to-Eng translation (from hypo) 45% (32%)
Showcase-1 Formal Evaluation • Data used for evaluation • Evaluation scheme: end-to-end, mono- and cross-lingual, SDU-based, human-grading • Compiling of results • Initial available results • Lessons learned...
Evaluation - Data Used • Goals: • unseen data not used for system development • both scenario-a and scenario-c, some MM data • original mono-lingual data and cross-lingual data collected when using the system • Mixture intended primarily for comprehensiveness, not for comparison of different conditions (stat significance)
Evaluation Methodology • Evaluation scheme: end-to-end, mono- and cross-lingual, SDU-based, human-grading • Evaluate translation from transcriptions and from SR output, also SR WERs, (SR as a paraphrase) • Multiple human graders - should NOT be system developers • One grader segments each turn into SDUs, graders then assign grades for each identified SDU • Cross lingual eval: • client SDUs from E/G/F --> Italian • agent SDUs from Italian --> E/G/F • Donna’s grading program
Compiling of Results • Each site should compile its own results! • Calculate separate results for: • each dialogue, each grader, client/agent SDUs • Average/combine results for: • all graders, client+agent, all dialogues combined
Results: SR Performance English SR Accuracy Speaker % Accuracy ------------------------ e025ap 68.6 e039ap 39.5 e011yp 83.1 e827cy 71.0 Average 61.9 German SR Accuracy Speaker % Accuracy ------------------------ g006 42.69 g034 66.32 g047 78.67 g051 69.43 Average 63.52
English Evaluation English Eval Data a1 = e025ap ( 46 SDUs) ( 27 utts) a2 = e039ap (123 SDUs) ( 37 utts) amm = e011yp ( 54 SDUs) ( 39 utts) cmm = e827cy (109 SDUs) ( 48 utts) ALL = total (332 SDUs) (151 utts)
English-to-English HYPO ---- G1 G2 G3 ALL | WA ------------------------------------------- a1 76(65) 74(61) 65(52) 72(59) | 68 ------------------------------------------- a2 55(39) 43(32) 50(35) 50(35) | 39 ------------------------------------------- amm 91(89) 93(85) 91(78) 91(84) | 84 ------------------------------------------- cmm 71(63) 65(59) 69(56) 68(59) | 70 ------------------------------------------- ALL 69(59) 63(54) 65(51) 66(56) | 61 -------------------------------------------
English-to-English SLT-TCT ---- G1 G2 G3 ALL ---------------------------------------- a1 74(70) 76(54) 67(41) 72(55) ---------------------------------------- a2 62(46) 45(40) 46(32) 51(39) ---------------------------------------- amm 74(57) 67(54) 61(48) 67(53) ---------------------------------------- cmm 65(49) 40(31) 51(31) 52(37) ---------------------------------------- ALL 67(52) 51(41) 53(35) 58(43) ---------------------------------------- SLT-REC ---- G1 G2 G3 ALL ---------------------------------------- a1 58(50) 52(33) 43(24) 51(36) ---------------------------------------- a2 41(27) 29(23) 33(21) 34(23) ---------------------------------------- amm 69(57) 70(63) 70(41) 70(54) ---------------------------------------- cmm 50(39) 32(26) 41(21) 41(29) ---------------------------------------- ALL 51(39) 40(32) 43(25) 45(32) ----------------------------------------
Results: English-to-English English-to-English a1 a2 amm cmm ALL ---------------------------------------------- TCT 72(55) 51(39) 67(53) 52(37) 58(43) ---------------------------------------------- REC 51(36) 34(23) 70(54) 41(29) 45(32) ---------------------------------------------- HYPO 72(59) 50(35) 91(84) 68(59) 66(56) ----------------------------------------------
Results: English-to-Italian English-to-Italian a1 a2 amm cmm ALL ----------------------------------------------- TCT 77(52) 48(36) 67(45) 59(31) 55(38) ----------------------------------------------- REC 57(39) 29(19) 69(44) 39(24) 43(27) ----------------------------------------------- English-to-English a1 a2 amm cmm ALL ---------------------------------------------- TCT 72(55) 51(39) 67(53) 52(37) 58(43) ---------------------------------------------- REC 51(36) 34(23) 70(54) 41(29) 45(32) ---------------------------------------------- HYPO 72(59) 50(35) 91(84) 68(59) 66(56) ----------------------------------------------
German Evaluation Graders: Dialogs: G1: Benjamin a1: g047ak ( 46 SDUs / 23 utts.) G2: Tanja a2: g051ak (174 SDUs / 59 utts.) G3: Stephan amm: g006yk (108 SDUs / 70 utts.) c1: g034ck (314 SDUs / 98 utts.) All: 644 SDUs / 350 utts.
German-to-German HYPO SLT-TCT SLT-REC G1 57 (50) 28 (23) 26 (23) G2 59 (50) 24 (6) 21 (5) G3 64 (48) 39 (7) 32 (5) All 58 (48) 31 (15) 25 (12) a1 81 (74) 55 (21) 52 (20) a2 71 (59) 35 (14) 34 (14) amm 38 (25) 34 (18) 22 (11) c1 58 (49) 23 (8) 19 (8) All 58 (48) 31 (15) 25 (12) G1 G2 G3 All HYPO 57 (50)59 (50)64 (48)58 (48) SLT-TCT 28 (23)24 (6)39 (7)31 (15) SLT-REC 26 (23)21 (5)32 (5)25 (12)
German-to-Italian SLT-TCT SLT-REC G1 31 (7) 26 (4) G2 38 (9) 32 (6) G3 30 (24) 26 (22) All 32 (13) 27 (11) a1 55 (21) 56 (22) a2 39 (13) 34 (12) amm 36 (18) 31 (15) c1 25 (10) 19 (8) All 32 (13) 27 (11) G1 G2 G3 All SLT-TCT 31 (7)38 (9)30 (24)32 (13) SLT-REC 26 (4)32 (6)26 (22)27 (11)
Lessons Learned/Issues • Variance between graders - what to do? • Segmentation variations - what to do? • Grading with two binary decisions • New data for next evaluation + save copy of current system version • Release current eval data for system dev? • Component Evaluation
Showcase-2 Open Issues • Definitions of the domains and scenarios for showcase-2a and showcase-2b • Data Collection • New functionalities: • for the users (client/agent) • for the system developers & for demonstration • Architecture modifications
Demo at IST Issues • Details about the demo • Demo “wrapper” around the system: • client initiates call from a web page • dealing with the push-to-talk issue • other functionalities? • Schedule for tests before demo