The CU Speech Group

Cambridge University Engineering Department Machine Intelligence Laboratory The CU Speech Group

Signal Processing Lab Communication Systems Lab Machine Intelligence Lab Control Lab 4 Staff Bill Byrne Mark Gales Phil Woodland Steve Young 8 RA’s 12 PhD’s Medical Imaging Group Vision Group Speech Group 2 1. Organisation 130 1100 450 Academic Staff Undergrads Postgrads CUED: 6 Divisions A. ThermoFluids B. Electrical Eng C. Mechanics D. Structures E. Management F. Information Engineering Division

Funded Projects in Recognition & Synthesis (5-10 RAs) MPhil in Computer Speech, Text and Internet Technology Computer Laboratory NLIP Group PhD Projects in Fundamental Speech Technology Development (10-15 students) Computer Speech and Language HTK Software ToolsDevelopment International Community 2. Speech Group Overview 3

HTK Rich Audio Transcription funded under DARPA “EARS” program TALK: Tools for Ambient Linguistic Knowledge EU (collab with Edin, Saarbrucken, Seville, BMW,…) SCILL: Spoken Conversation for Interactive Language Learning CMI (collab with MIT Spoken Language Systems group) Speech Recognition and Synthesis PhD projects funded by Toshiba Europe Ltd 3. Current Funded Projects Also active collaborations with IBM and Microsoft 4

( ) Computer Speech, Text and Internet Technology Web Technology programming in C++, Java, HTML, XML, client and server-side scripting Speech Processing analysis, recognition, synthesis Language Processing syntax, parsing, semantics, pragmatics Dialogue Systems understanding, generation, discourse processing and dialog acts Internet Applications document retrieval, topic tracking, information extraction, voice-driven web access 4. MPhil in CSTIT Modular Course: Full-time and Part-time Supported by EPSRC Masters Training Package Run Jointly by Computer Laboratory and Engineering Dept 5

http://htk.eng.cam.ac.uk See 6 5. HTK Development HTK is a free software toolkit for developing HMM-based speech recognition systems. 1989 – 1992 1993 – 1999 2000 – date V1.0 – 1.4 V1.5 – 2.3 V3.0 – V3.3 Initial development at CUED Commercial development by Entropic Academic development at CUED • Recent development funded by Microsoft and DARPA EARS Project. • Primary dissemination route for CU research output. New in 2004: the ATK Real-time HTK-based recognition system

data driven semantic processing • statistical modelling • optimisation Dialogue • large vocabulary systems [Eng, Chinese, (Fr & Ger) ] • acoustic model training and adaptation • language model training and adaptation • rich text transcription & spoken document retrieval • noise robustness and confidence measures Recognition • fundamental theory of statistical modelling and pattern processing Machine Learning 7 6. Research Interests • data driven techniques • voice transformation Synthesis

S 1 2 3 E Acoustic Vector Sequence Glue phones together to make words Y y y y y y y y = 1 2 3 4 5 6 7 /b/ /t/ /ih/ But since phones are highly context sensitive, we actually use context-dependent phones, eg Basic Sound (eg “ih” ) /?-b+ih/ /ih-t+?/ /b-ih+t/ 8 7. Speech Recognition All based on hidden Markov models ….. Each state defines a probability distribution describing the range of sounds which can be observed when in that state.

Explore new model structures: • dynamic Bayesian networks • support vector machines • distributed representations and loosely coupled models • Explore refinements, eg: • parameter-tying and interpolation • feature and model-space transformations • normalisation techniques • adaptation/compensation • discriminative training • lightly supervised training • Explore refinements: • implicit pronunciation models • long-span language models • adaptive and interpolated models 9 Some key problems • basic HMM assumptions are very weak • output distributions are extremely complex hence accurate modelling is difficult, especially covariance • significant speaker and environmental variability • need large number of models to cover all required contexts, so there is never enough training data • static pronunciations are unrealistic in spontaneous speech, and language is not N-gram like

Hidden Markov Model Switching Linear Dynamical System • Applying/extending machine learning techniques to ASR • Interesting issues due to dynamic nature of speech • Techniques under investigation: • dynamic Bayesian networks e.g. switching linear dynamical systems • support vector machines using generative model kernels • distributed representations and loosely coupled models Goal: Improved Acoustic Models/Classifiers 10 New Model Structures for Acoustic Modelling

Meta-Data Extraction (MDE) Markup Speaker1:/ okay carl {F uh} do you exercise / Speaker2:/ {DM yeah actually} {F um} i belong to a gym down here / / gold’s gym / / and {F uh} i try to exercise five days a week {F um} / / and now and then [REP i’ll + i’ll] get it interrupted by work or just full of crazy hours {DM you know } / Final Text Speaker1:Okay Carl do you exercise? Speaker2: I belong to a gym down here, Gold’s Gym, and I try to exercise five days a week and now and then I’ll get it interrupted by work or just full of crazy hours. 11 8. Rich Text Transcription ASR Output okay carl uh do you exercise yeah actually um i belong to a gym down here gold’s gym and uh i try to exercise five days a week um and now and then i’ll i’ll get it interrupted by work or just full of crazy hours you know

9. Current Research in Dialog Systems A Statistical Model of Spoken Dialogue Approach Problems • Treat dialog control as a state machine (i.e. a Markov Decision Process) • Assign a “reward” to every dialog state/move • Use reinforcement learning to optimise policy • State space is huge • True state cannot be observed • Training data difficult to obtain Statistical Dialogue Modelling 12

Speech Understanding Word Recognition Semantic Decoding Dialog Act Detection Finite State Word Graph or N-Gram Language Model Acoustic Model Key Steps: • Design statistical model • Estimate model parameters from data 13 Speech Understanding

Stack C Semantic Decoding select pre-terminal symbol based on current stack contents push/pop stack n positions based on previous stack state B travelreq A toplace place select next word based on current pre-terminal C .... to Paris on Sunday. Words W left to right scan 14 Parsing using the Hidden Vector State Model • similar approach to acoustic modeling • no interest in syntax, directly map words to concepts

model dialogue as a Markov process • define “goodness” in terms of rewards • learn parameters from real dialogues • optimise dialogues using reinforcement learning to maximise total expected rewards Dialogue Management: Dialogue Manager Actions • Define rewards • Learn transition function • Compute “Q Function” • Optimise Problems: training data expensive to produce state space is very large for realistic dialogues state is not observable in practice 15

Training Morphing Source->Target Transforms XForm Training Sinusoidal Speech Model Challenges: Target Speaker Target Speaker • maintaining high quality • scaling gracefully with available training data • transforming unknown speakers 10. Synthesis / Voice Conversion Source Speaker Sinusoidal Speech Model Prosodic & Spectral Features 16

http://mi.eng.cam.ac.uk/research/projects/EARS/ See 11. HTK Rich Audio Transcription Project Funded by the DARPA Effective, Affordable, Reusable Speech-to-text (EARS) programme The aim of the project is to very significantly advance the state-of-the-art while tackling the hardest speech recognition challenges including the transcription of broadcast news and telephone conversations. • 5 year programme (started April 2002) • EARS funded sites include: SRI, BBN, LIMSI, UW … • IBM and Microsoft also involved 17

Application domains used for evaluation Conversational telephone speech (CTS) Broadcast news transcription (BN) Tasks include Improving word error rate performance Developing rich text transcription techniques CUED involved in American English CTS American English BN Mandarin CTS Tasks and Languages • Aggressive performance targets specified • Progress on tasks/languages assessed through annual evaluations 18

Three main areas of research: Core Algorithm Development: The aim is to improve and develop new, generally applicable, techniques for speech recognition. Main focus of work in group. Metadata Generation This task will examine the automatic generation of acoustic and linguistic metadata. This data includes speaker identity and slash unit labelling. Public HTK Development This task aims to develop and enhance the core HTK software toolkit available via the HTK Website. Research at CUED 19

Hank Liao: Uncertainty Decoding for Noise Robust ASR Martin Layton: SVMs for Classifying Variable Length Data Kai Yu: Discriminative Adaptive Training Hui Ye: Voice Morphing Jason Williams: Statistical Dialogue Modelling 12. Student Talks 20

End 21

5. Current Research in ASR (cont) Technology required: • Automatic Speech Recognition • fast and accurate • ability to adapt to acoustic condition / speaker changes • Diarisation • absence / presence of speech • gender / identification of speaker • channel characteristics / background noise • Structural meta-data • extract disfluencies / “slash units” • allows elimination of non-relevant speech • facilitates addition of punctuation and capitalisation Rich Text Transcription (cont) 22

5. Current Research in ASR (cont) Data 1 Task 1 System Task 2 System Data 2 • Reduced cost for deploying ASR systems • Task porting: • adapt well-trained Task 1 system to Task 2 using limited data • requires schemes for adapting: acoustic and language models • Lightly supervised training: • reduce cost of transcriptions e.g. use closed captions • need to train on imperfect transcription Cheap System Building / Task Porting 23

Waveforms Words/Concepts Dialogue Acts 7. The Statistical Framework Speech Understanding System Dialogue Manager Speech Generation 24

The CU Speech Group