310 likes | 437 Views
Speech Output. Reading: Reiter and Dale, chap 7. Note: Simplenlg and Protege. Simplenlg Lexicaliser creates an SPhraseSpec from a Protégé instance Based on template mapping rules encoded in Protégé. Example. SPIKE: Subject = “there” Verb = “is” Complement = “a spike”
E N D
Speech Output Reading: Reiter and Dale, chap 7
Note: Simplenlg and Protege • Simplenlg Lexicaliser creates an SPhraseSpec from a Protégé instance • Based on template mapping rules encoded in Protégé
Example • SPIKE: • Subject = “there” • Verb = “is” • Complement = “a spike” • Modifier = [“in [channel]”, “to [peak_value]” • Channel, peak_value are features of spikes • Results in texts such as • There is a spike in HR to 160
Usage • Document Planner decides which instances to include in the text • Lexicaliser produces initial SPhraseSpec from these • Microplanner modifies SPhraseSpec • Add extra modifiers if necessary • Eg, “at 10.40” (if diff from last time mentioned) • Aggregation • Syntactic choice (passive, tense) • Referring exp (HR, Heart Rate) • Realiser produces text
Simplenlg and Protege • Complex, very much under development • Happy to discuss more with interested students • Prof Mellish is very interested in NLG and Semantic Web
Different Modalities • Many ways to communicate data • Visualisation • Written text • Spoken text (speech) • Combinations of above
Speech output • Computers can talk as well as write • Prerecorded files (eg, WAV) • Text-to-speech (TTS) • Speaks arbitrary texts • Example app: spoken weather forecasts • Output of our weather-forecast generator spoken for premium-rate telephone weather information services
Simple approach • Problem: speak aloud a written text • Simple approach • Record people speaking words • Given a text, combine recordings for all the words in the text • Telephone directory enquiries
Problems • Intonation/prosody • Difficult to understand monotone intonation • Cannot determine which word is meant • He lives on Don St. • St. Louis is a great city. • Conventions • £20 is twenty pounds, not pound twenty • New words (names, technical terms)
Problems • Pronouncing symbols • £ is pound or pounds ?? • I have £1 vs I have £5 vs I ate a £5 lunch • Pronouncing numbers • Individual digits or as a whole • 01224 273443 vs 1,224,273,443 people
Lexical Disambiguation • Which word is meant • a cat has nine lives (noun) • She lives here (verb) • I have a bow and arrow • I will not bow to her
Sophisticated text-to-speech • Determine grammatical structure • parsing • statistical techniques • Use this to determine • How to pronounce symbols, numbers • Lexical disambiguation • Rhetoric structure (for intonation)
Example: ATT Natural Voices • One of several commercial TTS systems • Nice demo at • http://www.research.att.com/~ttsweb/tts/demo.php
Prosodic Structure • Pitch change shows sentence type [?, ! ,.] • Hello. • Hello! • Hello? • Stress reflects importance, new information • *Mary gave John a book • Mary *gave John a book • etc
Pronunciation of new words • Eg, “Inverurie” • Rule-based • Use rules describing how phonemes are said in different contexts • Maybe models of human vocal cords, mouth • Concatenative • library of acoustic units, human-spoken • merged together for new words • Problems with both approaches
Markups • Speech markups (low-level) • pause • speed • volume • pitch • type (money, phone number) • Competing standards: • SAPI (Microsoft) • SSML (W3C)
Example I want to go <break/> <prosody volume="loud"> home </prosody>.
Speech Markups • Higher level markups • emphasis, deemphasis • character (eg, whisper) ?? • emotion ??? • Voice (accent, gender, age, …) ??
When is speech useful? • Ideas from class?
When (not) useful • Useful • Get attention (eg, urgent warning) • No screen or hands busy (eg, diver in water) • For visually impaired users • Not useful • Distracting (“you have spam”) • Long messages (text can be reread!) • Noisy environments • Deaf users
Systems • FreeTTS – free Java-based text-to-speech • Low voice quality, limited func, easy to use • Microsoft – Speech SDK • Higher quality, more func than FreeTTS • Tied to Windows, stresses VB, .net, etc • Commercial – highest quality • Natural Voices, RealSpeak, … • rVoice (Scottish software, mostly defunct)
Digression: rVoice • From Rhetorical Systems • Edinburgh Uni spinout • From Festival, also source of FreeTTS (practical) • High-profile “success story” of high-tech Scotland • rVoice • Very high quality voices (best in world?) • Could imitate a real person
Digression: rVoice • Not very successful as a business • Too expensive? • Some users (eg, blind people) wanted cheap soln • When high-quality voices needed (weather info), cheaper to hire people to speak messages • Recently bought by a competitor • Essentially being closed down, customers encouraged to move to competitors product • Sad…
Speech output from Java • Set up system • Set up a voice • Call “speak” method • (some systems) wait until speech finished • Speech takes time, system can do something else while speech is happening
FreeTTS example VoiceManager voiceManager = VoiceManager.getInstance(); Voice helloVoice = voiceManager.getVoice(“kevin16”); helloVoice.allocate(); helloVoice.speak(“Mary had a little lamb."); helloVoice.deallocate();
Advanced topic: concept-to-text • Currently NLG systems produce text, which is fed into speech synthesiser • But speech quality should improve if the NLG system gave more information • Syntactic structure (for pauses) • Desired meaning of word (for pronunciation) • Importance (for emphasis) • How integrate NLG and speech?
Speech Input • Talk to the computer instead of type • Commands (select from limited list) • Like cinema information line • Eg say name of movie you want to watch • Dictation • Dictate arbitrary texts • In recent versions of Office • Many errors
Speech dialogue • Dialogue with the computer, just like in science fiction movies • C: your first ascent was dangerous • H: why? • C: because you came up too quickly • H: what should I have done? • C: you should have taken 5 minutes to come up instead of 3 minutes
Speech dialogue • Key problems are • (a) dealing with speech input errors • Need to unobtrusively check that understood correctly • (b) dealing with strange things users say • Speech allows them to say anything, and they do! • (c) interpolating from ambiguous data • Does “Aberdeen” mean “Aberdeen, UK”, “Aberdeen, Maryland”, etc
Example User: Hello, I want to fly to London next Thursday System: What airport will you be flying from when you go to London, UK? User: Aberdeen System: What time on Thursday, 16 March, do you wish to depart from Aberdeen, Scotland? User: mid-morning System: BA 1305 leaves Aberdeen at 940 and arrives into London Heathrow at 1115. Should I book one seat for you on Thursday, 16 March?
Conclusion • Texts can be spoken instead of (or as well as) written • Harder than it seems, but technology exists and is getting better • Useful in some situations • In longer term, speech input and dialogue