VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES

VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES Roberto Pieraccini, CTO, Tell-Eureka Corporation 535 West 34th Street New York, NY 10001 +1 646 792 2744 roberto@telleureka.com http://www.telleureka.com

The vision

DIALOG SEMANTICS SPEECH RECOGNITION SPOKEN LANGUAGE UNDERSTANDING SYNTAX LEXICON MORPHOLOGY SPEECH SYNTHESIS PHONETICS DIALOG MANAGEMENT INNER EAR ACOUSTIC NERVE VOCAL-TRACT ARTICULATORS Recreating the Speech Chain

The technology

Von Kempelen (1791) Joseph Faber (1835) Talking Machines: First Steps into Spoken Language Technology Homer Dudley Bell Labs (1939)

Speech Recognition: the Early Years • 1952 – Automatic Digit Recognition (AUDREY) • Davis, Biddulph, Balashek (Bell Laboratories)

1960’s – Speech Processing and Digital Computers • AD/DA converters and digital computers start appearing in the labs James Flanagan Bell Laboratories

NP NP VP SEVEN THREE ZERO IS MY FOUR NUMBER TWO NINE SEVEN & E E & & r n b n n th ü e n o i s v z O & r r f m n t I e m v s r I The Illusion of Segmentation... or... Why Speech Recognition is so Difficult (user:Roberto (attribute:telephone-num value:7360474))

Ellipses and Anaphors Limited vocabulary Multiple Interpretations Speaker Dependency Word variations NP NP VP Word confusability SEVEN THREE ZERO IS MY Context-dependency FOUR NUMBER TWO NINE SEVEN Coarticulation Noise/reverberation E & & & E r n b n ü e n th n o s O v z i & r r I t f e n m m v s r I Intra-speaker variability The Illusion of Segmentation... or... Why Speech Recognition is so Difficult (user:Roberto (attribute:telephone-num value:7360474)) errors rules errors rules errors rules errors rules

J. R. Pierce Executive Director, Bell Laboratories 1969 – Whither Speech Recognition? […] General purpose speech recognition seems far away. Social-purpose speech recognition is severely limited. It would seem appropriate for people to ask themselves why they are working in the field and what they can expect to accomplish. […] It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. That is a necessary but no sufficient condition. We are safe in asserting that speech recognition is attractive to money. The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn’t attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamour. […] Most recognizers behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem.” The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach). The Journal of the Acoustical Society of America, June 1969

1971-1976: The ARPA SUR project • In spite of the anti-speech recognition campaign headed by the Pierce Commission ARPA launches into a 5 year program on Spoken Understanding Research • REQUIREMENTS: 1000 word vocabulary, 90%understanding rate, near real time on a 100 MIPS machine • 4 Systems built by the end of the program • SDC (24%) • BBN’s HWIM (44%) • CMU’s Hearsay II (74%) • CMU’s HARPY (95% -- 80 times real time!) • HARPY was based on an engineering approach • search on a network representing all the possible utterances • Lack of a scientific evaluation approach • Speech Understanding: too early for its timeThe project was not extended. LESSON LEARNED: Hand-built knowledge does not scale up Need of a global “optimization” criterion Raj Reddy -- CMU

Vintage Speech Recognition

Isolated Words Speaker Dependent Connected Words Speaker Independent Sub-Word Units 1970’s – Dynamic Time WarpingThe Brute Force of the Engineering Approach T.K. Vyntsyuk (1969) H. Sakoe, S. Chiba (1970) TEMPLATE (WORD 7) UNKNOWN WORD

Fred Jelinek Acoustic HMMs Word Tri-grams a11 a22 a33 a12 a23 S1 S2 S3 1980s -- The Statistical Approach • Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s • Purely statistical approach pursued by Fred Jelinek and Jim Baker at IBM T.J.Watson Research • Foundations of modern speech recognition engines Jim Baker • No Data Like More Data • Whenever I fire a linguist, our system performance improves (1988) • Some of my best friends are linguists (2004)

1980-1990 – The statistical approach becomes ubiquitous • Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.

1995 1997 1996 1998 1999 2000 2001 HOSTING 2002 2003 MIT 2004 2005 APPLICATION DEVELOPERS STANDARDS TOOLS SRI PLATFORM INTEGRATORS STANDARDS TECHNOLOGY VENDORS STANDARDS 1980s-1990s – The Power of Evaluation SPOKEN DIALOG INDUSTRY SPEECHWORKS NUANCE Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity”

The business of speech

Voice User Interface (VUI) Design—the Quantum Leap in Dialog Systems • 1995 -- The WildFire Effect • Change of perspective: From technology driven to user centered • RESEARCH: Natural Language free form • Commercial: Task completion and usability. • Persona: the personality of the application (TTS vs. Recording) • Speech recognition accuracy is important, but success is determined by the VUI. • The importance of a repeatable, streamlined, teachable, development process

Speech Scientist VUI Designer usability 8 full deployment speech science 7 Analyst VUI Designer 2 3 Project Manager 1 VUI design 10 9 6 VUI development partial deployment 4 5 requirements high level system design system engineering integration Architect, App Developer Engineer The Speech Application Lifecycle

Enter Transfer Get Origin Account Get Destination Account origin account Get Amount destination account amount > origin account? Play Wrong Amount Message YES amount NO Play Confirmation confirmed? What is wrong? NO YES Go to Main Menu Voice User Interface Design

Correct acceptance Accept Correctly Recognize Correct confirmation Confirm False acceptance - in Accept In Mis- Vocabulary recognize False confirmation Confirm Falsely False rejection Recognition Reject Correctly Correct rejection Reject Out of Vocabulary Falsely False acceptance - out Accept Speech Science: Tuning for performance

Speech Science: Tuning for performance DM ACTION Utt# = Number of utterances Sub-err% = percent of in-voc utterances wrongly recognized Fa-err% = percent of utterances wrongly accepted Fr-err% = percent of utterances wrongly rejected Rej% = total percent of all utterances rejected OOV% = percent of out-voc utterances Fa-oov% = percent of out-voc utterances wrongly accepted • Prioritize grammars that need improvement • Use transcriptions to improve grammars

The Architectural Evolution of Spoken Dialog 1994 1998 2000 2005 Native Code Standard Clients (VoiceXML) Proprietary IVR Systems Standard Application servers

MRCP SSML, SRGF The Voice Web SCXML? EMMA? Web Server Telephony Platform Voice Browser Internet TTS ASR VoiceXML /SALT Telephone CCXML

Spoken dialog as an anthropomorphic system Spoken dialog as a tool SLU: Statistical Language Understanding Large Vocabulary, Dialog Modules Small Vocabulary Menu Based The Evolution of the Interface and the Research-Industry Chasm Natural Language Research Systems a-la DARPA Communicator Directed Dialog 1994 1996 1998 2000 2002 2004 2006

The evolution of the market and the industry 600 to 1,000M$ revenue • > 8000 apps worldwide HOSTING APPLICATION DEVELOPERS PROFESSIONAL SERVICES TOOLS – AUTHORING, TUNING, PREPACKAGED APPLICATIONS New evolving standards guarantee interoperability of engines and platforms. PLATFORM INTEGRATORS IVR, VoiceXML, CTI,… TECHNOLOGY VENDORS SPEECH RECOGNITION, TTS

Third generation dialog systems 1st Generation INFORMATIONAL 2nd Generation TRANSACTIONAL 3RD Generation PROBLEM SOLVING BANKING CUSTOMER CARE PACKAGE TRACKING STOCK TRADING TECHNICAL SUPPORT FLIGHT STATUS FLIGHT/TRAINRESERVATION LOW MEDIUM HIGH COMPLEXITY

2005 -- Spoken Dialog goes to Saturday Night Live

VISIONS, TECHNOLOGY, AND BUSINESS OF TALKING MACHINES