170 likes | 300 Views
HLT Specifications. Translation Components Nespole! Trento Meeting May 26, 2000. Nespole! HLT Objectives. State of the art (C-STAR II): Broad but limited domain (Travel Planning) Spontaneous spoken language (disfluencies, incomplete and non grammatical sentences)
E N D
HLT Specifications Translation Components Nespole! Trento Meeting May 26, 2000
Nespole! HLT Objectives • State of the art (C-STAR II): • Broad but limited domain (Travel Planning) • Spontaneous spoken language (disfluencies, incomplete and non grammatical sentences) • Task oriented dialogue (non-descriptive) • incomplete coverage (semi-scripted demo)
Main Analysis and Generation Approaches • Robust parsing using domain specific semantic grammars and simple mappers (CMU/UKA) • Phrase analysis grammars and IF classification trees/mappers (IRST) • Syntactic and semantic analysis with mappers to and from IF (CLIPS) • Direct translation using a multi-engine architecture (EBMT, glossaries,dictionaries) (CMU/UKA would like to investigate)
Nespole! HLT Objectives • Scalability- expansion of existing domain: • expanding coverage of IF to broader Travel Domain as required for APT showcase • development of analysis and generation approaches that support easy expansion • new broad and general IF representation and • appropriate analysis and generation approaches
Nespole! HLT Objectives • Portability- easy expansion into new domains: • extending existing IF with Domain Actions for other domains (Help Desk for 2nd showcase) • new broad IF representation • new analysis and generation approaches that are appropriate for the new broad IF
Nespole! HLT Objectives • Robustness - ability to handle more corrupt input and graceful degradation of performance: • multiple alternative analysis/translation approaches • better identification of out-of-domain utterances and confidence measures
CMU/UKA Planned Approaches • New analysis approach for domain-specific task-oriented language combines rule-based and statistical/trainable methods • New analysis engine for new style IF, using chunk parser followed by new combiner and mapper • Possibly addition of MEMT direct translation approach for coverage and robustness • Effective combination and disambiguation of all above approaches • New generation from IF using GenKit
New Approach: SALT SALT - Statistical Analyzer for Lang. Translation • Combines ML trainable and rule-based analysis methods for robustness and portability • Rule-based parsing restricted to well-defined set of argument-level phrases and fragments • Trainable classifiers (NN, Decision Trees, etc.) used to derive the DA (speech-act and concepts) from the sequence of argument concepts. • Phrase-level grammars are more robust and portable to new domains
Alternative Approach: MEMT Multi Engine Machine Translation • Translates directly into target language (no IF) • Based on Pangloss/Diplomat translation system developed at CMU • Uses a combination of EBMT, phrase glossaries and a bilingual dictionary • English/German system operational • Good fall-back for uncovered utterances
HLT Server Components • Each HLT Server consists of an Analysis Chain and a Generation Chain • Analysis Chain: • Speech Recognition + analysis into IF • Generation Chain: • Generation from IF + Speech Synthesis • Each site free to develop its own analysis and generation technology • Communication between modules is primarily via IF, using the ComSwitch server and protocol
Main Constraints and Requirements • Maintain site technology freedom and distributed HLT development as much as possible • Leverage off existing C-STAR technology • start with existing analysis and generation engines • use (extend) C-STAR CommSwitch protocol • New server architecture allows: • constant availability for testing and development • plug-and-play of new modules • separation of external API issues from required HLT communication
Data Collection for Translation Component Development • Analysis of extended domain for first showcase • CLIPS and APT data (also translated into English) • Preparations for data collection with APT • real dialogues between users and APT agents • monolingual dialogues • schedule? Amount of data be collected? • Annotation of collected data?
Points for Discussion • Definition of the Scenario for SC-1 • Data Collection with APT • Overview of Approaches • HLT Servers
Definition of Scenario • Analysis of APT email data (Paolo) • 9 main categories • developed ~20 specific scenarios • APT will look at scenarios and prioritize them, and prioritize web pages (for translation to French) within 10 days • We will use existing web pages for APT (in I,G,E), and some translated into French • Goal is to focus on up to 10 scenarios
Data Collection with APT • Logistics: • dedicated line (to be determined) • recording done centrally at the APT side by IRST with data provided via the web site • Time-line: • Start time to be determined (end June?) • 50 dialogues per language, 4 dialogues per hour • data collection by end of August • transcription by end of September • Annotation with IF by end of October
HLT Servers • Modify existing C-STAR II components into a server module • initial server version ready by ~end of June • Comm Server between HLT modules will be updated by CMU and sent to Nespole web site
Overview of Approaches • IRST: emphasis on statistical approaches to analysis and classification into IF; generation using a rule-based system • CLIPS: new IF-to-French generator; analysis approach will initially stay similar