Module u1: Speech in the Interface 3: Speech input and output technology

Module u1:Speech in the Interface 3: Speech input and output technology Jacques Terken SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

contents • Speech input technology • Speech recognition • Language understanding • Consequences for design • Speech output technology • Language generation • Speech synthesis • Consequences for design • Project SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Components of conversational interfaces Application Speech Synthesis Language Generation Dialogue Manager Natural Language Analysis Speech recognition Noise suppression SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Speech recognition • Advances both through progress in speech and language engineering and in computer technology (increases in CPU power) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

2000 1980 1990 Developments SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

State of the art SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Why is generic speech recognition so difficult ? • Variability of input due to many different sources • Understanding requires vast amounts of world knowledge and common sense reasoning for generation and pruning of hypotheses • Dealing with variability and with storage of/ access to world knowledge outreaches possibilities of current technology SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Noise Distortion ・Other speakers Speech Noise ・Background noise Echoes ・Reverberations recognition Channel Dropouts system Speaker Task/Context ・Voice quality ・Man-machine Microphone ・Pitch dialogue ・Distortion ・Gender ・Dictation ・Electrical noise ・Dialect ・Free conversation ・Directional Speaking style ・Interview characteristics ・Stress/Emotion Phonetic/Prosodic ・Speaking rate context ・Lombard effect Sources of variation SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

No generic speech recognizer • Idea of generic speech recognizer has been given up (for the time being) • automatic speech recognition possible by virtue of self-imposed limitations • vocabulary size • multiple vs single speaker • real-time vs offline • recognition vs understanding SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Speech recognition systems • Relevant dimensions • Speaker-dependent vs speaker-independent • Vocabulary size • Grammar: fixed grammar vs probabilistic language model • Trade-off between different dimensions in terms of performance: choice of technology determined by application requirements SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Command and control • Examples: controlling functionality of PC or PDA; controlling consumer appliances (stereo, tv etc.) • Individual words and multi-word expressions • “File”, “Edit”, “Save as webpage”, “Columns to the left” • Speaker-independent: no training needed before use • Limited vocabulary gives high recognition performance • Fixed format expressions (defined by grammar) • Real-time User needs to know which items are in the vocabulary and what expressions can be used (Usually) not customizable SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Information services • Examples: train travel information, integrated trip planning • Continuous speech • Speaker-independent: Multiple users • Mid size vocabulary, typically less than 5000 words • Flexibility of input: extensive grammar that can handle expected user inputs • Requires interpretation • Real time SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Dictation systems • Continuous speech • Speaker-dependent: requires training by user • (Almost) unrestricted input: Large vocabulary > 200.000 words • Probabilistic language model instead of fixed grammar • No understanding, just recognition • Off-line (but near online performance possible depending on system properties) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

State of the art ASR: Statistical approach • Two phases • Training: creating an inventory of acoustic models and computing transition probabilities • Testing (classification): mapping input onto inventory SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Speech • Writing vs speech Writing: {see} {eat break} {lake} Speaking: {si: i:t}{brek lek} Alphabetic languages: appr. 25 signs Average language: approximately 40 sounds Phonetic alphabet (1:1 mapping character-sound) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Speech and sounds Waveform and spectrogram of “How are you” Speech is made up of nondiscrete events SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Representation of the speech signal • Sounds coded as successions of states (one state each 10-30 ms) • States represented by acoustic vectors Freq Freq time time SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

pdf pdf pdf pdf Acoustic models • Inventory of elementary probabilistic models of basic linguistic units, e.g. phonemes • Words stored as networks of elementary models SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Training of acoustic models • Compute acoustic vectors and transition probabilities from large corpora • each state holds statistics concerning parameter values and parameter variation • The larger the amount of training data, the better the estimates of parameter values and variation SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Language model • Defined by grammar • Grammar: • Rules for combining words into sentences (defining the admissible strings in that language) • Basic unit of analysis is utterance/sentence • Sentence composed of words representing word classes, e.g. determiner: the noun: boy verb: eat SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

noun: boy verb: eat determiner: the rule 1: noun_phrase det n rule 2: sentence  noun_phrase verb Morphology: base forms vs derived forms eat stem, 1st person singular stem + s: 3rd person singular stem + en: past participle stem + er: substantive (noun) the boy eats *the eats *boy eats *eats the boy SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Statistical language model • Probabilities for words and transition probabilities for word sequences in corpus: unigram: probability of individual words bigram: probability of word given preceding word trigram: probability of word given two preceding words • Training materials: language corpora (journal articles; application-specific) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Speech input Acoustic analysis ... x x 1 T Phoneme inventory P ( x x w w ) ... | ... 1 T 1 k Global search: Maximize Pronunciation lexicon P ( x x w w ) P ( w w ) ... ... ... | ・ 1 T 1 k 1 k P ( w w ) ... 1 k over w w ... Language model 1 k Recognized Word sequence Recognition / classification SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Compute probability of sequence of states given the probabilities for the states, the probabilities for transitions between states and the language model • Gives best path • Usually not best path but n-best list for further processing SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Caveats • Properties of acoustic models strongly determined by recording conditions: recognition performance dependent on match between recording conditions and run-time conditions • Use of language model induces word bias: for words outside vocabulary the best matching word is selected • Solution: use garbage model SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Advances • Confidence measures for recognition results • Based on acoustic similarity • Or based on actual confusions for a database • Or taking into consideration the acoustic properties of the input signal • Dynamic (state-dependent) loading of language model • Parallel recognizers • e.g. In Vehicle Information Systems (IVIS): separate recognizers for navigation system, entertainment systems, mobile phone, general purpose • choice on the basis of confidence scores • Further developments • Parallel recognizer for hyper-articulate speech SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

State of the art performance • 98 - 99.8 % correct for small vocabulary speaker-independent recognition • 92 - 98 % correct for speaker-dependent large vocabulary recognition • 50 - 70 % correct for speaker-independent mid size vocabulary SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Recognition of prosody • Observable manifestations: pitch, temporal properties, silence • Function: emphasis, phrasing (e.g. through pauses), sentence type (question/statement), emotion &c. • Relevant to understanding/interpretation, e.g.: Mary knows many languages you know Mary knows many languages, you know • Influence on realisation of phonemes: Used to be considered as noise, but contains relevant information SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

contents • Speech input technology • Speech recognition • Language understanding • Consequences for design • Speech output technology • Consequences for design • project SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Natural language processing • Full parse or keyword spotting (concept spotting) • Keyword spotting: <any> keyword <any> e.g. <any> $DEPARTURE <any> $DESTINATION <any> can handle: Boston New York I want to go from Boston to New York I want a flight leaving at Boston and arriving at New York • Semantics (mapping onto functionality) can be specified in the grammar SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

contents • Speech input technology • Speech recognition • Language understanding • Consequences for design • Speech output technology • Consequences for design • project SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

coping with technological shortcomings of ASR • Shortcomings • Reliability/robustness • Architectural complexity of “always open” system • Lack of transparency in case of input limitations • Task for design of speech interfaces: induce user to modify behaviour to fit requirements (restrictions) of technology SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Solutions • “Always open” ideal • push-to-talk button: recognition window “spoke-too-soon” problem • Barge in (requires echo cancellation which may be complicated depending on reverberation properties of environment) • Make training conditions (properties of training corpus) similar to test conditions E.g. special corpora for car environment SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Good prompt design to give clues about required input SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

contents • Speech input technology • consequences for design • Speech output technology • Technology • Human factors in speech understanding • Consequences for design • project SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Components of conversational interfaces Application Speech Synthesis Language Generation Dialogue Manager Natural Language Analysis Speech recognition SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

demos • http://www.ims.uni-stuttgart.de/~moehler/synthspeech/examples.html • http://www.research.att.com/~ttsweb/tts/demo.php • http://www.acapela-group.com/text-to-speech-interactive-demo.html • http://cslu.cse.ogi.edu/tts/ • Audiovisual speech synthesis: http://www.speech.kth.se/multimodal/ http://mambo.ucsc.edu/demos.html • Emotional synthesis (Janet Cahn): http://xenia.media.mit.edu/%7Ecahn/emot-speech.html SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Applications • Information Access by phone • news / weather, timetables (OVR), reverse directory, name dialling, • spoken e-mail etc. • Customer Ordering by phone (call centers) • IVR: ASR replaces tedious touch-tone actions • Car Driver Information by voice • navigation, car traffic info (RDS/TMC), Command & Control (VODIS) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Interfaces for the Disabled • MIT/DECTalk (Stephen Hawking) • In the office and at home: (near future?) • Command & Control, navigation for home entertainment SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Output technology LangGeneration Speech Synthesis Dialogue Manager Application (e.g. E-mail) Application (Information service) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Language generation Eindhoven Amsterdam CS Vertrektijd 08:32 08:47 09:02 09:17 09:32 Aankomsttijd 09:52 10:10 10:2210:40 10:52 Overstappen 0 1 0 1 0 SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

If $nr_of_records >1 I have found $n connections: The first connection leaves at $time_dep from $departure and arrives at $time_arr at $destination The second connection leaves at $time_dep from $departure and arrives at $time_arr at $destination SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

If the user also wants information about whether there are transfers, either other templates have to be used, or templates might be composed from template elements SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

speech output technologies • canned (pre-recorded) speech • Suited for call centers, IVR • fixed messages/announcements SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Concatenation of pre-recorded phrases • Suited for bank account information, database enquiry systems with structured data and the like • Template-based, e.g. “your account is <$account”>” “the flight from <$departure> to <$destination> leaves at <$date> at <$time”> from <$gate>” “the number of $customer is $telephone_number” • Requirements: database of phrases to be concatenated • Some knowledge of speech science required: • words are pronounced differently depending on • emphasis • position in utterance • type of utterance • differences concern both pitch and temporal properties (prosody) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Compare different realisations of Amsterdam in • do you want to go to Amsterdam ? (emphasis, question, utterance-final) • I want to go to Amsterdam (emphasis, statement, utterance -final) • Are there two stations in Amsterdam ? (no emphasis, question, utterance-final) • There are two stations in Amsterdam (no emphasis, statement, utterance-final) • Do you want to go to Amsterdam Central Station? (no emphasis, statement, utternace-medial) • Solution: • have words pronounced in context to obtain different tokens • apply clever splicing techniques for smooth concatenation SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

text-to-speech conversion (TTS) • Suited for unrestricted text input: all kinds of text • reading e-mail, fax (in combination with optical character recognition) • information retrieval for unstructured data (preferably in combination with automatic summarisation) • Utterances made up by concatenation of small units and post-processing for prosody, or by concatenation of variable units SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

TtS technology • Distinction between • linguistic pre-processing and • synthesis • Linguistic pre-processing • Grapheme-phoneme conversion: mapping written text onto phonemic representation including word stress • Prosodic structure (emphasis, boundaries including pauses) SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

TtS: Linguistic pre-processing:grapheme-phoneme conversion • To determine how a word is pronounced: • consult a lexicon, containing • a phoneme transcription • syllable boundaries • word accent(s) • and/or develop pronunciation rules • Output: Enschede . ‘ En-sx@-de . Kerkrade . ‘ kErk-ra-d@ . ‘s-Hertogenbosch . sEr-to-x@n-‘ bOs . SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Pros and con’s of lexicon • phoneme transcriptions are accurate • (high) risk of out-of-vocabulary words because the lexicon : • often contains only stems, no inflections, nor compounds • is never up to date / complete • but usually the application includes a user lexicon SAI User-System Interaction u1, Speech in the Interface: 3. Speech input and output technology

Module u1: Speech in the Interface 3: Speech input and output technology