Speech Output

Speech Output Reading: Reiter and Dale, chap 7

Note: Simplenlg and Protege • Simplenlg Lexicaliser creates an SPhraseSpec from a Protégé instance • Based on template mapping rules encoded in Protégé

Example • SPIKE: • Subject = “there” • Verb = “is” • Complement = “a spike” • Modifier = [“in [channel]”, “to [peak_value]” • Channel, peak_value are features of spikes • Results in texts such as • There is a spike in HR to 160

Usage • Document Planner decides which instances to include in the text • Lexicaliser produces initial SPhraseSpec from these • Microplanner modifies SPhraseSpec • Add extra modifiers if necessary • Eg, “at 10.40” (if diff from last time mentioned) • Aggregation • Syntactic choice (passive, tense) • Referring exp (HR, Heart Rate) • Realiser produces text

Simplenlg and Protege • Complex, very much under development • Happy to discuss more with interested students • Prof Mellish is very interested in NLG and Semantic Web

Different Modalities • Many ways to communicate data • Visualisation • Written text • Spoken text (speech) • Combinations of above

Speech output • Computers can talk as well as write • Prerecorded files (eg, WAV) • Text-to-speech (TTS) • Speaks arbitrary texts • Example app: spoken weather forecasts • Output of our weather-forecast generator spoken for premium-rate telephone weather information services

Simple approach • Problem: speak aloud a written text • Simple approach • Record people speaking words • Given a text, combine recordings for all the words in the text • Telephone directory enquiries

Problems • Intonation/prosody • Difficult to understand monotone intonation • Cannot determine which word is meant • He lives on Don St. • St. Louis is a great city. • Conventions • £20 is twenty pounds, not pound twenty • New words (names, technical terms)

Problems • Pronouncing symbols • £ is pound or pounds ?? • I have £1 vs I have £5 vs I ate a £5 lunch • Pronouncing numbers • Individual digits or as a whole • 01224 273443 vs 1,224,273,443 people

Lexical Disambiguation • Which word is meant • a cat has nine lives (noun) • She lives here (verb) • I have a bow and arrow • I will not bow to her

Sophisticated text-to-speech • Determine grammatical structure • parsing • statistical techniques • Use this to determine • How to pronounce symbols, numbers • Lexical disambiguation • Rhetoric structure (for intonation)

Example: ATT Natural Voices • One of several commercial TTS systems • Nice demo at • http://www.research.att.com/~ttsweb/tts/demo.php

Prosodic Structure • Pitch change shows sentence type [?, ! ,.] • Hello. • Hello! • Hello? • Stress reflects importance, new information • *Mary gave John a book • Mary *gave John a book • etc

Pronunciation of new words • Eg, “Inverurie” • Rule-based • Use rules describing how phonemes are said in different contexts • Maybe models of human vocal cords, mouth • Concatenative • library of acoustic units, human-spoken • merged together for new words • Problems with both approaches

Markups • Speech markups (low-level) • pause • speed • volume • pitch • type (money, phone number) • Competing standards: • SAPI (Microsoft) • SSML (W3C)

Example I want to go <break/> <prosody volume="loud"> home </prosody>.

Speech Markups • Higher level markups • emphasis, deemphasis • character (eg, whisper) ?? • emotion ??? • Voice (accent, gender, age, …) ??

When is speech useful? • Ideas from class?

When (not) useful • Useful • Get attention (eg, urgent warning) • No screen or hands busy (eg, diver in water) • For visually impaired users • Not useful • Distracting (“you have spam”) • Long messages (text can be reread!) • Noisy environments • Deaf users

Systems • FreeTTS – free Java-based text-to-speech • Low voice quality, limited func, easy to use • Microsoft – Speech SDK • Higher quality, more func than FreeTTS • Tied to Windows, stresses VB, .net, etc • Commercial – highest quality • Natural Voices, RealSpeak, … • rVoice (Scottish software, mostly defunct)

Digression: rVoice • From Rhetorical Systems • Edinburgh Uni spinout • From Festival, also source of FreeTTS (practical) • High-profile “success story” of high-tech Scotland • rVoice • Very high quality voices (best in world?) • Could imitate a real person

Digression: rVoice • Not very successful as a business • Too expensive? • Some users (eg, blind people) wanted cheap soln • When high-quality voices needed (weather info), cheaper to hire people to speak messages • Recently bought by a competitor • Essentially being closed down, customers encouraged to move to competitors product • Sad…

Speech output from Java • Set up system • Set up a voice • Call “speak” method • (some systems) wait until speech finished • Speech takes time, system can do something else while speech is happening

FreeTTS example VoiceManager voiceManager = VoiceManager.getInstance(); Voice helloVoice = voiceManager.getVoice(“kevin16”); helloVoice.allocate(); helloVoice.speak(“Mary had a little lamb."); helloVoice.deallocate();

Advanced topic: concept-to-text • Currently NLG systems produce text, which is fed into speech synthesiser • But speech quality should improve if the NLG system gave more information • Syntactic structure (for pauses) • Desired meaning of word (for pronunciation) • Importance (for emphasis) • How integrate NLG and speech?

Speech Input • Talk to the computer instead of type • Commands (select from limited list) • Like cinema information line • Eg say name of movie you want to watch • Dictation • Dictate arbitrary texts • In recent versions of Office • Many errors

Speech dialogue • Dialogue with the computer, just like in science fiction movies • C: your first ascent was dangerous • H: why? • C: because you came up too quickly • H: what should I have done? • C: you should have taken 5 minutes to come up instead of 3 minutes

Speech dialogue • Key problems are • (a) dealing with speech input errors • Need to unobtrusively check that understood correctly • (b) dealing with strange things users say • Speech allows them to say anything, and they do! • (c) interpolating from ambiguous data • Does “Aberdeen” mean “Aberdeen, UK”, “Aberdeen, Maryland”, etc

Example User: Hello, I want to fly to London next Thursday System: What airport will you be flying from when you go to London, UK? User: Aberdeen System: What time on Thursday, 16 March, do you wish to depart from Aberdeen, Scotland? User: mid-morning System: BA 1305 leaves Aberdeen at 940 and arrives into London Heathrow at 1115. Should I book one seat for you on Thursday, 16 March?

Conclusion • Texts can be spoken instead of (or as well as) written • Harder than it seems, but technology exists and is getting better • Useful in some situations • In longer term, speech input and dialogue

Speech Output

Speech Output

Presentation Transcript

Output-output correspondence

Output

Output

Output

The Speech Speech

Output

Output

Output

Output

Output

Output

Output

OUTPUT

Module u1: Speech in the Interface 3: Speech input and output technology

output

OUTPUT

Output-output correspondence

Output

Output