1 / 64

Spoken dialogs with computers

Spoken dialogs with computers. Krzysztof Marasek. Man-Machine communication: How can it work?. Man-machine interaction by graphic Man-machine communication by speech multi-modal man-machine communication is that all? - No: haptic, structure changes (Logitech mouse). Lecture’s topic.

Download Presentation

Spoken dialogs with computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spoken dialogs with computers Krzysztof Marasek

  2. Man-Machine communication: How can it work? • Man-machine interaction by graphic • Man-machine communication by speech • multi-modal man-machine communication • is that all? - No: haptic, structure changes (Logitech mouse) Lecture’s topic

  3. Lecture overview • encapsulated phonetics • sound propagation • acoustic properties of speech • basic signal forms and distinctive features of speech sounds • phonemes and allophones, grapheme-to-phoneme conversion • speech parameterization • speech data collection • speech modeling • language modeling and basics of natural language processing • speech recognition techniques- Viterbi decoding • model and pronunciation adaptation • speech synthesis methods and Text-To-Speech synthesis • speech understanding • dialog design • applications and challenges of speech technology

  4. Architecture of spoken dialog Our topics Speech recognition Speech interpretation Dialog manager Text generation Speech synthesis De Mori, 99

  5. Human Language Technology worldwide • USA • Oregon • Cornegie Mellon • MIT • Japan • Kyoto • Tokyo • Europe • Germany • France • Great Britain • Scandinavia • Italy • wild East

  6. Wszystko jasne • Bdanaia na pweynm anelgiksim uneruwstytecie wyzakały, że nie ma znczeania, w jaikej kloejności napsziemy lietry wenątrz wryazu, blye tlkyo pirwesza i otstaina lreita błyy na soiwch mijsecach. Rtszea mżoe być dolnwoie poszamienina, a mmio to bedzięmy w stniae pczyrzetać tkest bez wikszęego prleobmu. Diezje się tak dlteago, że nie cztaymy kżdeaj z lteir odelndziie, ale wrayz jkao cłoaść. Eric Campbell

  7. Speech signal characterization time signal energy spectrogram pitch duration

  8. Man-Machine communication by speech coding phrases sentences words letters Czesc! decoding • Man-machine communication: exchange of information coded in a way suitable for transmission through a physical medium • Coding: the process of producing a representation of what has to be communicated • knowledge sources: constraints for building symbolic version of the message and transmission through a physical channel • decoding: models of KSs used by computer: deterministic but often imprecise

  9. Knowledge sources: acoustic models coding phrases words syllables phones Czesc! • Coding of acoustic speech events: • phones: representation of basic speech units • coding alphabet: e.g. IPA codes, but also other exist (SAMPA)

  10. Speech Recognition: A simple decoder model Information source Information channel W X • Modern systems are based on probabilistic scores for candidate hypotheses • model of hypotheses scoring: let the sequence of acoustic observations X=x1..xN be the output of the information channel. If the intention of the speaker was the sequence of words W=W1…WK then X is coded version of W • The objective of recognition is to reconstruct W based on the observation of X Utterance -> Speech signal -> PCM -> window ->coefficients -> X

  11. Knowledge sources: acoustic models • How phones can be modeled? • Let’ s assume that speech signal is parameterized as a sequence of feature vectors computed for equally spaced speech frames • parameters are statistically independent • Hidden Markov Models: • easy training • robust modeling • also other models are used: ANN, kernel methods, hybrid systems, but HMM are currently state-of-the-art

  12. Knowledge sources: HMMs Typical phone model topology • Nodes of graph correspond to states of Markov chain, while directed arcs correspond to allowed transtions aij • A sequence of observation is regarded as an emission of the system which at each time instant makes a transition form one to another node randomly chosen according to a node-specific probability density and generates a random vector according to arc-specific probability density. A number of states and set of arcs is usually called model topology. • In ASR it is common to have left-to-right topologies, in which aij=0 for j<I • also usually first and last states are not-emitting, i.e. source and final states are for setting initial and final probabilities, HMM

  13. Knowledge sources: HMM linking

  14. Knowledge sources: language model • Language model: set of constraints on the sequence of words acceptable for a given language • rules of generative grammarG to produce sentences of a language LG(G) • 4-tuple , where Vt is a set of all words of LG(G), VN is a set of non-terminal symbols representing abstractions of language components (ex. syntax), s - category of all sentences in LG(G), P-set of rules a->b, where a is sequnce of symbols with at least one belongs to VN and b (VT u Vn) • if a is only one symbol in VN then grammar G is context-free • for natural languages it is impossible to conceive a grammar G capable of generating all and only sentences of a language: no formal models of NL • heuristic solution - stochastic finite state automata: over-generating grammar for word pairs plus probabilities of generated sentences (bigrams) -HMMs • integrated network: automata for each word combined of lexical and acoustic models describing pronunciation variants (phonemes) and distribution of acoustic parameters of phonemes

  15. Statistical modeling approach for ASR computed as a distance to trained models A priori probability of the word string W A priori probability of the acoustic sequence A given a word sequence W Bayesian approach Most probable word sequence W given the acoustic input A A priori probability of the acoustic sequence A Traditional HMM Output: (W -word model)

  16. What can be recognized? (!ENTER{_SIL_}( Kutno | Sopot | Pozna\\361 | Lubin | £uk\\363w | aleja Solidarnoœci | Beskidy | Rzesz\\363w )(!ENTER{_SIL_} I=172 W=Jana I=173 W=Jura I=174 W=Kazimierz J=0 S=1 E=0 J=1 S=1 E=1 J=2 S=2 E=0 J=3 S=2 E=1 J=4 S=3 E=0 J=5 S=3 E=1 J=6 S=4 E=0 J=7 S=4 E=1 J=8 S=5 E=0 J=9 S=5 E=1 J=10 S=6 E=0 Vocabulary Lattice of models Kalisz k a l i S Kamienna k a m j e n n a Kaszuby k a S u b I Katowice k a t o v i ts e Kazimierz k a zi i m j e Z Kielce k j e l ts e Klakson k l a k s o n Kolor k o l o r Konopnickiej k o n o p ni i ts k j e j Konstytucji k o n s t I t u ts j i Koszalin k o S a l i n Kościuszki k o si tsi u S k i Krakowska k r a k o f s k a Krakowsko k r a k o f s k o Kraków k r a k u f Krzyki k S I k i Kujaw k u j a f Kutno k u t n o Dictionary

  17. Speaker-independent, continuous-speech ASR now possible • Digit recognition over the telephone with word error rate of 0.3% • Error rate cut in half every two years for moderate vocabulary tasks • Error for spontaneous speech are more than twice that of read speech • Conversational speech, involving multiple speakers and poor acoustic environment, remains a challenge • Tens of hours of training data to port to a different domain • Statistical modelling using automatic training achieves significant advances digits 1k read 2k spontaneous 20k read 64 k broadcast 10k conversational 100 10 1 0.1 MIT,2005

  18. Text-To-Speech Text preprocessing Prosody generation Acoustic output Word descriptions • Festival Speech Synthesis - steps to synthesize a sentence • Text • Token_POS • Token • POS • Word • Phrasify • Pauses • Intonation • PostLex • Duration • Int_Targets • Wave_Synth

  19. Speech synthesis • Acoustic output: • pre-recorded speech • articulatory synthesis (formant synthesis) - tries to mimic human voice generation • Frankfurt • concatenative synthesis - build utterances using stored units • phonems • diphones: trasitions between two phonemes • Festival • unit selection: units of different length, context depended selection (maximum length of natural speech sequence) • RealSpeak • ATR Japan

  20. Example: Tokens, syllables and phones

  21. Phrasing and Intonation

  22. Dialog system MIT 2005

  23. Dialog systems • Application dependent (dialog structure and content) • finite-state dialog system (usually domain-dependent) • chatter-bot systems (domain-independent?) • initiative possession (machine, human, mixed) • concept detection and spotting (find important staff in the utterance and conclude) • concept and text generation (generate context-dependent answer) • Examples: IVR, reservation systems, but sometimes still not perfect...

  24. Going beyond… • Add new dimensions to MMI (para- and extralinguistic features) • avatars • personality of the dialog partners • speaker`s profile • reaction on speaker`s emotion and emotional synthesis (rad)

  25. Generation of word hypotheses: Speech recognition De Mori, 99

  26. Part II HUMAN TO HUMAN COMMUNICATION

  27. Dialog architecture De Mori, 99

  28. SPEECH SPEAKER LISTENER Domains of verbal communication PSYCHOLINGUSITCS Utterance forming Understanding PHYSILOGY Articulation Hearing ACOUSTICS Speech acoustics Psychoacoustics Generation of speech Perception of speech

  29. Eyes – visual information • Ears – sound information • Nose –smell information • Tongue –taste information • Skin, muscles, touch receptors – touch and proprio-kinesthetic information • Proprio-kinesthetic feedback mechanisms include awareness of the movement and location of the fingers in space, • internal monitoring of rhythm and rate, and a grip What and how

  30. Articulatory organs – sounds (speech) • Movement and action organs –gestures, writing, mimics, mechanic actions etc. Organs involved in the production of infomation by humans

  31. Organs which may be involved in human-to-human communication Articulatory organs – hearing: speech Articulatory organs –seeing : lips reading Move and action organs –seeing: writing, gestures, sign language, Braille’ writing

  32. Levels of communication Lingustic information Articulatory infos (phonetics) Emotional information Personal information Information on organic speech disorders Information on neurogenic speech disorders Culture, habitats, social information

  33. Speech – spoken language Writings – written language Signs – sign language (polish, german, english etc.) Language – a system of charaters and phonological, semantic and syntactic rules which allow to combine this characters Language is a basis for all human to human communication

  34. Sentence generation scheme Syntax component Phrasing rules Lexicon Deep structure Semantic component Semantic sentence interpretation Phonological sentence interpretation Surface structure Phonological component Transformation rules

  35. Beep Keep Leep Example of meaning changes by changed phonological structure of the word Phonology, part of natural language processing, describes phonemes and relation between phonemes

  36. Articulation

  37. nasal cavity Main articulatory elements of the vocal tract lips tongue glottis

  38. Types of sounds • Sound classification is based on manner and place of articulation – where the consttriction in the vocal tract is and where the sound is generated • Manner of articulation: • Vowels • Plosives /p/, /g/ • Nasals - /m/, /n/ • Taps or trills /r/ • Fricatives -/s/, /f/, /v/ • Approximants - /j/ /w/ • Place of articulation: • Bilabial • Labiodental • Dental, Alveolar, Postalveolar • Retroflex • Palatal • Velar • Uvular • Pharyngeal • Glottal -- resonants – obstruents -- affricates -- diphtongs

  39. Typical vocal tract configurations vowel articulation front -back high - low plosives articulation front (of the tongue) back (of the tongue)

  40. Typical vocal tract configurations Front fricative Front lateral approximant

  41. Comparison of airflow over nose and mouth

  42. Tongue profiles

  43. Tongue profiles

  44. Tongue profiles

  45. Tongue profiles

  46. Phonetic transcription

  47. IPA code - vowels

  48. IPA consonants

  49. SAMPA American English Consonants:24 Symbol Word Transcription p pin pIn b bin bIn t tin tIn d din dIn k kin kIn g give gIv tS chin tSIn dZ gin dZIn f fin fIn v vim vIm T thin TIn D this DIs s sin sIn z zing zIN S shin SIn Z measure "mEZ@` h hit hIt m mock mAk n knock nAk N thing TIN r wrong rON l long lON w wasp wAsp j yacht jAt Vowels:17 I pit pIt E pet pEt { pat p{t A pot pAt V cut kVt U put pUt i ease iz e raise rez u lose luz o nose noz O cause kOz aI rise raIz OI noise nOIz aU rouse raUz 3` furs f3`z @ allow @"laU @` corner "kOrn@`

  50. Transcription by hand

More Related