1 / 20

Spoken Language Processing:Summing Up

Spoken Language Processing:Summing Up. Julia Hirschberg CS 4706. What We’ve Studied. Speech phenomena What can people convey by varying the way they say something? How we identify this kind of variation? What tools do we have for analysis? Speech generation (TTS)

len-hill
Download Presentation

Spoken Language Processing:Summing Up

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spoken Language Processing:Summing Up Julia Hirschberg CS 4706

  2. What We’ve Studied • Speech phenomena • What can people convey by varying the way they say something? • How we identify this kind of variation? • What tools do we have for analysis? • Speech generation (TTS) • Speech recognition (ASR) and understanding (ASRU) • Applications for speech technologies

  3. What phenomena vary in speech? • Intonational contours (ToBI) • Phrasing: scope • Accent: focus, given/new • Overall contour: speech acts • Pitch range, timing • Topic structure • Voice quality, intensity, … • Emotion • Deception? • Charisma?

  4. Analyzing Speech: At the Acoustic Level • How do we capture speech data for analysis? • Digitizing: sampling, quantization, filtering • How can we distinguish one speech sound from another? • Periodic vs. aperiodic waveforms • Characterizing periodic waveforms: cycle, period, phase • Displaying and analyzing spectra, pitch tracks • Comparing intensity (db) • Tools to do all this and more: Praat

  5. Analyzing Speech: At the Phonetic Level • Can we distinguish different languages in terms of their phoneme sets? Are their universal constraints on possible speech sounds? • Articulatory constraints • How do we characterize the sounds of a given language: • Acoustic differences associated with placeand manner of articulation distinguish consonants • Vowels differ in their formant frequencies • Do we use such information in speech technologies?

  6. Articulators in action (Sample from the Queen’s University / ATR Labs X-ray Film Database) “Why did Ken set the soggy net on top of his deck?”

  7. Articulatory parameters for English consonants (in ARPAbet) MANNER OF ARTICULATION VOICING:

  8. HIGH iy uw ix ux ih uh oy ey ow ax FRONT BACK ao aw eh ah ay ae aa LOW American English vowel space

  9. Analyzing Speech: At the Phononological Level • How do people develop models of intonation? • ToBI • Tones: Pitch accents, phrase accents, boundary tones • Break indices • Hand labeling vs. automatic analysis • Which provides more useful information?

  10. L-L% L-H% H-L% H-H% H* L* L*+H

  11. L-L% L-H% H-L% H-H% L+H* H+!H* H* !H*

  12. Speech Generation • Synthesis then and now • Open problems in TTS: • Pronunciation modeling: OOV words, homographs, abbreviations • Predicting pitch accents and phrase boundaries: corpus-based approaches • Information status: focus, given/new • Modeling discourse structure • Producing emotional speech • Evaluation

  13. Speech Recognition/Understanding • ASR then and now: From speaker-dependent digit recognition using analog circuits to HMM-based speaker-independent recognition of spontaneous speech by computer • Open problems • Segmentation: sentence, speaker, topic • OOV recognition • Handling disfluencies • Evaluation: transcription, semantic, task-based? • Recognizing emotion and other types of speaker state

  14. Spoken Dialogue Systems • Integrating TTS and ASR with dialogue management and task-based components • Open questions: • Improving ASR accuracy • Recognizing dialogue acts • Turn-taking behavior • Confirmation strategies and initiative • Entrainment and ‘personality’ • Evaluation

  15. Recognizing Speaker State and Diagnosis • Emotional speech • Voice quality • Deceptive speech • Charismatic speech • Customer care rep evaluation • Medical diagnosis • Paranoia and other psychiatric disorders • Cancer patient prognosis

  16. Take-Home Final • Due: May 14 by 4:10 pm • Submission instructions: • This examination is designed to test your ability to synthesize information and to perform critical analysis of published research. Choose 3 of the following 4 questions to answer Each question should be answered with specific reference to the readings specified, all of which are linked to the syllabus for the class on the date given. (I.e., cite articles with page numbers to support claims about authors’ findings or claims, as “McLeod et al. (1998) claims that existing Spoken Dialogue Systems’ major drawback is their lack of delightful personalities (p. 4).”) Do not attempt to answer the questions until you have read and understood the specified articles. Essays that do not show evidence of this understanding will not receive high marks. • Each essay will be worth 33 1/3 points. Each essay should be no more than 1200 words in length; only the first 1200 words of each essay will be graded, so please do not exceed this limit. If you can answer the question in a shorter essay, feel free to do so. Please use plain ascii or Word and report word-counts for each essay.

  17. Sample Question Agree or disagree: “It is more difficult to recognize deception automatically from acoustic/prosodic and lexical cues than from visual cues obtained from face or body gesture.” Use the readings assigned for April 28 to support your answer. • Show that you understand the question and are answering it • E.g. “I believe that it is more difficult to recognize deception automatically from from visual cues than from acoustic/prosodic and lexical cues.” • For agree/disagree questions, decide whether you basically agree or disagree • e.g. “While there are difficulties recognizing deception from both types of cues, I believe it is more difficult to recognize deception from visual cues than from language-based cues.”

  18. Provide evidence on both sides of the question • “While both audio and visual cues require high quality recordings, audio recordings must be obtained in a quiet environment whereas video recordings can be obtained in a wider variety of situations, providing that equipment is available.” • “While Mehrabian (1971) found significant effects for both visual and language-based cues, the particular language cues he identified in this study would seem to be easier to recognize automatically than the visual cues: For example, it should be easier to identify amount of speech and speaking rate than features such as ‘rocking gestures’ and ‘leg and foot movements’.” • Support your statements with specific reference to your sources • e.g. “DePaulo et al (1983) find that…” • Or, “Motivation greatly influences subjects’ ability to control their verbal cues (DePaulo et al, 1983).”

  19. When in doubt, cite

More Related