Thomas Fang Zheng C enter of S peech T echnology (CST)

30 Oct 01 at Communications Research Lab, Kyoto Chinese Spoken Dialogue Systems -- Speech Activities in CST Thomas Fang Zheng Center of Speech Technology (CST) State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

Outline • Brief Introduction to CST • Speech R&D Activities (w/ paper references) • A Flight Spoken Dialogue System - EasyFlight • System Overview • Keyword Based Robust Parser • Powerful Dialogue Manager • Demonstrations • EasyFlight - Flight inquiry & reservation dialog system • EasyNav - THU Campus navigation dialog system • Thanks Center of Speech Technology, Tsinghua University

Center of Speech Technology • Founded in 1979, named as Speech Laboratory • Joined the State Key Laboratory of Intelligent Technology and Systems in 1999, renamed as Center of Speech Technology • http://sp.cs.tsinghua.edu.cn/ Center of Speech Technology, Tsinghua University

Members of CST in 2001 Center of Speech Technology, Tsinghua University

Founding Resources • State fundamental research plan: • NSF • 863 • 973 • 985 (Tsinghua University) • Collaboration with industries: • Analog Devices, Inc. • IBM • Intel • Keysun Information Technology Limited • Lucent Technologies • Microsoft • Nokia • SoundTek Technology Limited • Weniwen Technologies Limited) • ... Center of Speech Technology, Tsinghua University

Acoustic Modeling Feature Extraction and Selection Acoustic Modeling Accurate & fast AM Search Robustness Speech Enhancement Fractals Speaker Adaptation Speaker Normalization Chinese Pronunciation Modeling Language Modeling Characteristics of Chinese Language Modeling and Search LM Adaptation & New Word Induction Natural/Spoken Speech Understanding (NLU/SLU) NLU - GLR Based Parsing SLU - KW based robust parsing Dialogue Manager Applications Command and control Keyword spotting Language Learning Input method editor Chinese dictation machine Spoken dialogues Speaker identification and verification Resources Speech R&D Activities Center of Speech Technology, Tsinghua University

Feature Extraction and Selection • Fan Wang, Fang Zheng, and Wenhu Wu. “An MCE based Classification Tree Using Hierarchical Feature-Weighting in Speech Recognition,” EuroSpeech’2001, 3:1947-1950, Sept. 3-7, 2001, Aalborg, Denmark • Xinyan Zhang. “Subband analysis based robust speech recognition,” Graduate Project: Tsinghua University, Beijing. June 2001. Center of Speech Technology, Tsinghua University

Acoustic Modeling • Jiyong Zhang, Fang Zheng, Jing Li, Chunhua Luo, and Guoliang Zhang, “Improved Context-Dependent Acoustic Modeling for Continuous Chinese Speech Recognition,” EuroSpeech, 3:1617-1620, Sept. 3-7, 2001, Aalborg, Denmark • Zheng Fang, Wu Wenhu, and Fang Ditang, “Center-Distance Continuous Probability Models And the Distance Measure,” J. of Computer Science and Technology, 13(5): 426-437, Sept., 1998 • ZHENG Fang, MOU Xiaolong, WU Wenhu, and FANG Ditang, “On the Embedded Multiple-Model Scoring Scheme for Speech Recognition,” International Symposium on Chinese Spoken Language Processing (ISCSLP'98), ASR-A3, pp.49-53, Dec.7-9, 1998, Singapore • Guo Qing, Zheng Fang, Wu Jian and Wu Wenhu, “A new method used in HMM for modeling frame correlation,” ICASSP, pp. I-169~172, March 15~19, 1999, Phoenix Center of Speech Technology, Tsinghua University

Accurate & fast AM Search • Guoliang Zhang, Fang Zheng, and Wenhu Wu, “A Two-Layer Lexical Tree Based Beam Search in Continuous Chinese Speech Recognition,” EuroSpeech, 3:1801-1804, Sept. 3-7, 2001, Aalborg, Denmark • Jian Wu, and Fang Zheng. “Reducing time-synchronous beam search effort using stage based look-ahead and language model rank based pruning,” ICSLP’00, pp. IV-262~265 • Zhanjiang Song, Fang Zheng, and Wenhu Wu. “Statistical knowledge based frame synchronous search strategies in continuous speech recognition,” ICASSP’00, pp. III-1583~1586 • Jiyong Zhang, Fang Zheng, Shu Du, Zhanjiang Song and Mingxing Xu. “Merging based syllable detection automaton in continuous Chinese speech recognition,” J. of Software, 10(11): 1212~1215, Nov. 1999 (in Chinese) • Fang Zheng, Zhanjiang Song, Mingxing Xu, et al. “EasyTalk: A Large-Vocabulary Speaker-Independent Chinese Dictation Machine,” EuroSpeech'99, Vol. 2, pp.819-822, Budapest, Hungary, Sept. 1999 • Fang Zheng, Mingxing Xu, and Wenhu Wu. “Search strategies in continuous speech recognition,” 5th National Conference on Man-Machine Speech Communications(NCMMSC’98)，138-143, Jul. 26-31, 1998, Harbin (in Chinese) Center of Speech Technology, Tsinghua University

Speech Enhancement • YANG Dali, XU Mingxing, WU Wenhu, ZHENG Fang, “A Noise Cancellation Method Based on Wavelet Transform,” International Symposium on Chinese Spoken Language Processing, pp. 211-214, Oct. 13-15, 2000, Beijing Center of Speech Technology, Tsinghua University

Fractals • Fan Wang, Fang Zheng, and Wenhu Wu, “A C/V segmentation method for Mandarin speech based on multiscale fractal dimension,” International Conference on Spoken Language Processing, pp. IV-648~651, Oct. 16-20, Beijing • WANG Fan, ZHENG Fang, and WU Wenhu, “A self-adapting endpoint detection algorithm for speech recognition in noisy environments based on 1/f process,” International Symposium on Chinese Spoken Language Processing, pp. 327-330, Oct. 13-15, 2000, Beijing Center of Speech Technology, Tsinghua University

Speaker Adaptation • Lei He, Jian Wu, Ditang Fang, Wenhu Wu, “Speaker adaptation based on combination of map estimation and weighted neighbor regression,” IEEE ICASSP, pp.II-981~984, June 5-9, 2000, Istanbul, Turkey Center of Speech Technology, Tsinghua University

Speaker Normalization • Lei HE, Ditang FANG, and Wenhu WU, “Speaker normalization training and adaptation for speech recognition,” International Conference on Spoken Language Processing, pp. IV-342~345, Oct. 16-20, Beijing • Tranzai LEE, Fang ZHENG, and Wenhu WU, “Reference point alignment frequency warp method for speaker adaptation,” International Conference on Signal Pocessing, pp. II-756~759, Aug. 21-25, 2000, Beijing Center of Speech Technology, Tsinghua University

Chinese Pronunciation Modeling • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Modeling Based on CASS Corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing • Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation modeling of Mandarin casual speech,”Workshop 2000 on Speech and Language Processing: Final Report for MPM Group, http://www.clsp.jhu.edu/index.shtml • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark Center of Speech Technology, Tsinghua University

Language Modeling • Jian Wu and Fang Zheng, “On enhancing Katz-smoothing based back-off language model,” International Conference on Spoken Language Processing, pp. I-198~201, Oct. 16-20, Beijing • Xiaolong Mou, Jinming Zhan, Fang Zheng and Wenhu Wu. “The N-Gram Language Model Based on the Back-off Estimation Algorithm,” The 5th National Conference on Man-Machine Speech Communication (NCMMSC’98), 206-209, July 26-31, 1998, Harbin (in Chinese) Center of Speech Technology, Tsinghua University

LM Search • Fang Zheng, Jian Wuand Zhanjiang Song, “Improving the Syllable-Synchronous Network Search Algorithm for Word Decoding in Continuous ChinesE Speech Recognition ,” J. Computer Science & Technology, 15(5): 461-471, Sept. 2000 • Fang Zheng, “A Syllable-Synchronous Network Search Algorithm for Word Decoding in Chinese Speech Recognition,” ICASSP, pp. II-601~604, March 15~19, 1999, Phoenix • Fang Zheng, Jian Wu and Wenhu Wu, “Input Chinese sentences using digits,” International Conference on Spoken Language Processing, pp. III-127~130, Oct. 16-20, Beijing Center of Speech Technology, Tsinghua University

LM Adaptation & New Word Induction • Genqing Wu, Fang Zheng, Ling Jin, and Wenhu Wu, “An online incremental language model adaptation,” EuroSpeech, 3:2139-2142, Sept. 3-7, 2001, Aalborg, Denmark Center of Speech Technology, Tsinghua University

NLU - GLR Based Parsing • Yinfei Huang, Fang Zheng, Yi Su, Fang Li, Wenhu Wu, “A Theme Structure Method for the Ellipsis Resolution,” EuroSpeech, 3:2153-2156, Sept. 3-7, 2001, Aalborg, Denmark • Yi Su, Fang Zheng, and Yinfei Huang, “Design of a Semantic Parser with Support to Ellipsis Resolution in a Chinese Spoken Language Dialogue System,” EuroSpeech, 3:2161-2164, Sept. 3-7, 2001, Aalborg, Denmark • Yinfei HUANG, Fang ZHENG, Mingxing XU, Pengju Yan, and Wenhu WU, “Language understanding component for Chinese dialogue system,” International Conference on Spoken Language Processing, pp. III-1053~1056, Oct. 16-20, Beijing • Yan Pengju, Zheng Fang, Xu Mingxing, Huang Yinfei, “Word-class stochastic model in a spoken language dialogue system,” International Symposium on Chinese Spoken Language Processing, pp. 141-144, Oct. 13-15, 2000, Beijing Center of Speech Technology, Tsinghua University

SLU - KW Based Robust Parsing • Pengju Yan, Fang Zheng, Hui Sun, and Mingxing Xu, “Parsing spontaneous speech in the dialogue systems,” EuroSpeech, 3:2149-2152, Sept. 3-7, 2001, Aalborg, Denmark Center of Speech Technology, Tsinghua University

Dialogue Manager (DM) • Xiaojun Wu, Fang Zheng and Mingxing Xu. “TOPIC Forest: A plan-based dialogue management structure,” International Conference on Acoustics, Speech and Signal Processing, Vol. I., May 7-11, Salt Lake City, USA • Li Fang, Zheng Fang, Wu Wenhu, Huang Yinfei, “Dynamic Query Organization and Response Generation in Spoken Dialogue System,” 19th International Conference on Computer Processing of Oriental Languages, May 14-16, Seoul, Korea Center of Speech Technology, Tsinghua University

Applications &References • Command and control • Fang Zheng, Qixiu Hu, Xiang Deng, et al. “An introduction to a kind of voice dialers for dummies,” 4th National Conference on Man-Machine Speech Communications (NCMMSC’96), pp.165-168, Oct. 1996, Beijing (in Chinese) • Yinfei Huang, Fang Zheng, and Wenhu Wu. “EasyCmd: Navigation by Voice Commands,” International Symposium on Chinese Spoken Language Processing (ISCSLP’00), pp. 145-148, Oct. 13-15, 2000, Beijing Center of Speech Technology, Tsinghua University

Keyword spotting • Zheng Fang, Xu Mingxing, Mou Xiaolong, et al. “HarkMan - A Vocabulary-Independent Keyword Spotter for Spontaneous Chinese Speech,” J. of Computer Science and Technology (JCST), 14(1): 18-26, Jan., 1999 Center of Speech Technology, Tsinghua University

Language Learning (Pronunciation Scoring) • Zhanjiang Song, Fang Zheng, Mingxing Xu, and Wenhu Wu. “An Effective Scoring Method for Speaking Skill Evaluation System,” EuroSpeech'99, Vol. 1, pp.187-190, Budapest, Hungary, Sept. 1999 Center of Speech Technology, Tsinghua University

Input method editor (IME) • Fang Zheng, Jian Wu, and Wenhu Wu. “Input Chinese sentences using digits,” International Conference on Spoken Language Processing (ICSLP’00), pp. III-127~130, Oct. 16-20, Beijing • Ling JIN, Genqing Wu, Fang Zheng, and Wenhu Wu. “Improved strategies for intelligent sentence input method engine system,” International Symposium on Chinese Spoken Language Processing (ISCSLP’00), pp. 247-250, Oct. 13-15, 2000, Beijing Center of Speech Technology, Tsinghua University

Chinese dictation machine (CDM) • Fang Zheng, Zhanjiang Song, Mingxing Xu, et al. “EasyTalk: A Large-Vocabulary Speaker-Independent Chinese Dictation Machine,” EuroSpeech'99, Vol. 2, pp.819-822, Budapest, Hungary, Sept. 1999 • Jian Wu, and Fang Zheng. “Reducing time-synchronous beam search effort using stage based look-ahead and language model rank based pruning,” ICSLP’00, pp. IV-262~265 Center of Speech Technology, Tsinghua University

Spoken dialogues • Yinfei Huang, Fang Zheng, Mingxing Xu, et al. “Language understanding component for Chinese dialogue system,” ICSLP’00, pp. III-1053~1056, Oct. 16-20, Beijing • Yan Pengju, Zheng Fang, Xu Mingxing, et al. “Word-class stochastic model in a spoken language dialogue system,” ICSLP’00, pp. 141-144, Oct. 13-15, 2000, Beijing • Pengju Yan, Fang Zheng, Hui Sun, et al. “Parsing Spontaneous speech in the dialogue systems,” to be submitted • Xiaojun Wu, Fang Zheng and Mingxing Xu. “TOPIC Forest: A plan-based dialogue management structure,” to appear in ICASSP’2001 Center of Speech Technology, Tsinghua University

Speaker identification and verification • Language Identification • … Center of Speech Technology, Tsinghua University

Resources • Chinese Speech Database • Standard Chinese (25 CD-ROMs) • Chinese w/ Yue accent (41 CD-ROMs) • Real-world spontaneous telephone dialogue (200 hours) • Chinese annotated spontaneous speech (CASS) corpus (6 hours) • 863 Speech Recognition Database (40 CD-ROMs) • 863 Speech Synthesis Database (8 CD-ROMs) • Chinese Text Database • People’s Daily Center of Speech Technology, Tsinghua University

EasyFlight A Spoken Dialogue System for Flight Information Inquiry and Flight Reservation

System Overview • EasyFlight is a spoken dialogue system providing • Flight information inquiry; and • Flight reservation. • EasyFlight features: • Context-dependent understanding (w/ remembering and forgetting scheme to support ellipsis(省略)) • Robust parsing (to enable spoken language phenomena) • Topic changeable (to allow user shift among topics freely) • Mixed-initiative (混合主导)(both the user and the machine can guide the following conversations at anytime) Center of Speech Technology, Tsinghua University

Keyword Spotter Syntactic Analyzer Keyword Lattice User Utterance Dynamic Vocabulary Syntax Tree Speech Response Dynamic Rule Set Text-to-Speech Dialogue Manager Semantic Analyzer Texts/Tags Contexts Semantic frame Inquiry & Update Text response Results Response Focus Dialog History & Status Domain Database Response Generator Maintenance System Block Diagram Center of Speech Technology, Tsinghua University

Keyword Based Robust Parser • We use a keyword-based parser and a context free grammar (CFG) for spoken language understanding • The symbols of the grammar are semantic-relevant items • Why keywords? • Why Grammar? • How we do? Center of Speech Technology, Tsinghua University

Why keywords? • For spoken dialogues, there are often • Speech Recognition Errors: deletion, substitution, and insertion • Spontaneous Speech Phenomena: garbage, hesitation, repetition, correction, fragment, ellipsis, word disordering, ill form and so on • So difficult to get fully correct recognized sentence for full sentence parsing • An alternative way: keyword spotting, semantics-based grammar, partial parsing (each partial result is maintained) Center of Speech Technology, Tsinghua University

Why grammar? • The sentence structure can be viewed as a deterministic tree. Center of Speech Technology, Tsinghua University

Why grammar? (cont’d) • The structure of the underlying semantics (语义)and/or the domain knowledge can also be viewed as a deterministic tree. Center of Speech Technology, Tsinghua University

Problems when using grammar • Chinese is an ideographic (表意的)language • sentence in Chinese: casual than English • difficult to be modeled with syntactic grammars • In dialogue systems, ungrammatical phenomena are common seen • ellipses or missing words/phrases • repetitions • garbage • fragments • disordering • ill forms Center of Speech Technology, Tsinghua University

Solutions • Define special types of CFG rules to deal with spoken language phenomena. • Unlike Parts-of-Speech (POSes) as terminal symbols in traditional grammar, use keyword categories as terminal symbols and semantic units as non-terminal symbols to form a semantics-based grammar • Enhance and modify the traditional chart parser into the Marionette parser Center of Speech Technology, Tsinghua University

A Keyword based robust parser includes: • Keyword List, used as lexicon for recognizer and terminal symbols in the semantic grammar • Grammar Definition, four types of rules are defined • Grammar Transcription, a semantic grammar based on the analysis on a real-world domain corpus • Marionette Parser, a chart parser making use of the aforesaid grammar and eliminating ambiguities by pruning/optimizing Center of Speech Technology, Tsinghua University

Keyword list • ~700 lexical words • ~70 semantic categories • 3 larger classes • Material class (实体类) - each word contains some real domain-specific info. • Tag class (标记类) - each category plays a different role in identifying user’s intention • Atom class (原子类) - no word has substantial semantic meaning of their own but can be combined to become larger constituents (成分). Center of Speech Technology, Tsinghua University

Grammar Definition • 4 types of grammar rules to cope with the spontaneous speech phenomena • Up-tying type (苛刻型) - where the sub-constituents are strictly tied together, as appeared in conventional grammar • By-passing type (跳跃型)- where the sub-constituents are combined together whether there exist gap words in between • Up-messing class (无序型) - where the sub-constituents can appear in any order • Over-crossing class (交叉型) - where the occupations of the sub-constituents can overlap with each other • Overall features • Keywords are taken as terminal symbols • All constituents are within semantics category instead of syntactic category • Thus the grammar is a semantic one • The grammar size is over 250 Center of Speech Technology, Tsinghua University

Semantic Grammar Transcription (examples) • up-tying (苛刻型) rules • Some crucial information are not allowed to be mixed/inserted by other terms, e.g. personal ID no. • E.g. (in China, ID no. can be 15-digit or 18-digit long) • sub_id_card_head *ato_0to9_yao + ato_0to9_yao +… +ato_0to9_yao (15 identical terms) • id_card_no sub_id_card_head • id_card_no *sub_id_card_head + ato_0to9_yao + ato_0to9_yao + ato_0to9_yao • This is the traditional rule type. Center of Speech Technology, Tsinghua University

Semantic Grammar Transcription (examples) • by-passing (跳跃型)rules • Contrarily, some utterances are allowed to be inserted with recognition garbage/fillers, or meaningless parts, e.g., “星期啊三嗯星期四” • E.g. • sub_week_day  ato_week + ato_1to6 • sub_week_day_list  sub_week_day • sub_week_day_list  sub_week_day + sub_week_day_list • sub_date  sub_week_day_list • A great deal of rules are of this type. Center of Speech Technology, Tsinghua University

Semantic Grammar Transcription (examples) • up-messing (无序型) rules • Some information, such as time, city names, plane types etc., can appear without following any predefined orders • E.g. • timeloc_info_cond @ info_date_time_cond + info_fromto • plane_info @ mat_airline_code + mat_aircraft_type • flight_info_cond @ timeloc_info_cond + plane_info Center of Speech Technology, Tsinghua University

Semantic Grammar Transcription (examples) • over-crossing (交叉型) rules • Some phrases/constituents, such as “是…吗”, can have other constituents appear in between, e.g. “是到北京吗”, “是两张吗” • E.g. • mark_q_is  tag_is_or_not • mark_q_is  tag_is + tag_question_mark • mark_q_is  tag_is_q • confirm_request # mark_q_is + confirm_c Center of Speech Technology, Tsinghua University

When ambiguities (歧义) are met, evaluation are made according to some criteria, the constituent which ranks highest will survive • position (in sentence) • occupation (number of leaf nodes) • depth, etc. Center of Speech Technology, Tsinghua University

Marionette Parser - an enhanced chart parser • Maintaining all partial results • Combining non-adjacent constituents (By-passing rules) • Considering all the possible order of the constituents (Up-messing rules) • Grouping the constituents whether their occupations overlap with each other or not (Over-crossing rules) • Taking the precedence of later sub-constituents over earlier ones • Taking the precedence of larger sub-constituents over smaller ones. Center of Speech Technology, Tsinghua University

A part of the parsing algorithm Center of Speech Technology, Tsinghua University

In a semantic tree (as in a syntactic tree) • Statically, each rule node corresponds to a semantic function. • Dynamically, each constituent node has a pointer to a semantic function • The semantic analysis procedure is a procedure to call semantic functions: • At the very beginning, the topmost node’s semantic function is called; • The child node’s semantic function will be called by its parental node recursively; • Until the semantics is obtained finally. Center of Speech Technology, Tsinghua University

A parsing example. Center of Speech Technology, Tsinghua University

Powerful Dialogue Manager (DM) • Role: • Maintain dialogue contexts and states • Direct the dialogues • Accept parsing results and generate responses • Desired features: • Be able to deal with multiple topics • Topics can be changed freely • Be able to make full use of information shared by different topics and to support ellipsis (when topic changed from one to another) • User and machine mixed-initiative (混合主导) • Be adaptive to users’ interests & parlances • Be domain-transparent to user (easy to port to new systems) Center of Speech Technology, Tsinghua University

Thomas Fang Zheng C enter of S peech T echnology (CST)