850 likes | 985 Views
Phonetic features in ASR. Intensive course Dipartimento di Elettrotecnica ed Elettronica Politecnica di Bari 22 – 26 March 1999 Jacques Koreman Institute of Phonetics University of the Saarland P.O. Box 15 11 50 D - 66041 Saarbrücken E-mail : Germany jkoreman@coli.uni-sb.de.
E N D
Phonetic features in ASR Intensive course Dipartimento di Elettrotecnica ed Elettronica Politecnica di Bari 22 – 26 March 1999 Jacques KoremanInstitute of PhoneticsUniversity of the SaarlandP.O. Box 15 11 50D - 66041 Saarbrücken E-mail: Germany jkoreman@coli.uni-sb.de
Organisation of the course • Tuesday – Friday:- First half of each session: theory- Second half of each session: practice • Interruptions invited!!!
Overview of the course 1. Variability in the signal 2. Phonetic features in ASR 3. Deriving phonetic features from the acoustic signal by a Kohonen network 4. ICSLP’98: “Exploiting transitions and focussing on linguistic properties for ASR” 5. ICSLP’98: “Do phonetic features help to improve consonant identification in ASR?” day day day day day
The goal of ASR systems • Input: spectral description of microphone signal, typically- energy in band-pass filters- LPC coefficients- cepstral coefficients • Output: linguistic units, usually phones or phonemes (on the basis of which words can be recognised)
Variability in the signal (1) Main problem in ASR: variability in the input signalExample: /k/ has very different realisations in different contexts. Its place of articulation varies from velar before back vowels to pre-velar before front vowels(own articulation of “keep”,“cool”)
Variability in the signal (2) Main problem in ASR: variability in the input signalExample: /g/ in canonical form is sometimes realised as a fricative or approximant , e.g. intervocalically (OE. regen > E. rain). In Danish, this happens to all intervocalic voiced plosives; also, voiceless plosives become voiced.
Variability in the signal (3) Main problem in ASR: variability in the input signalExample: /h/ has very different realisations in different contexts. It can be considered as a voiceless realisation of the surrounding vowels.(spectrograms “ihi”, “aha”, “uhu”)
Variability in the signal (3a) ] ] [ i: i: [ h a: ] [ u: h u: h a:
Variability in the signal (4) Main problem in ASR: variability in the input signalExample: deletion of segments due to articulat- ory overlap. Friction is superimposed on the vowel signal. (spectrogram G.“System”)
Variability in the signal (4a) [ b0 d e p0 s i m a l z Y p0 t e m ] s (
Variability in the signal (5) Main problem in ASR: variability in the input signalExample: the same vowel /a:/ is realised differ- ently dependent on its context. (spectrogram “aba”, “ada”, “aga”)
Variability in the signal (5a) [ b0 b a: ] [ b0 d a: ] [ a: b0 g a: ] a: a:
Modelling variability • Hidden Markov models can represent the variable signal characteristics of phones 1-p1 1-p2 1-p3 1 p1 p2 p3 E S
Lexicon and language model (1) • Linguistic knowledge about phone sequences (lexicon, language model) improves word recognition • Without linguistic knowledge, low phone accuracy
Lexicon and language model (2) Using a lexicon and/or language model is not a top-down solution to all problems: sometimes pragmatic knowledge needed. Example: [rsp] Recognise speech Wreck a nice beach
Lexicon and language model (3) Using a lexicon and/or language model is not a top-down solution to all problems: sometimes pragmatic knowledge needed. Example: [] Get up at eight o’clock Get a potato clock
Segment and label a signal Practical: CONCLUSIONS • The acoustic parameters (e.g. MFCC) are very variable. • We must try to improve phone accuracy by extracting linguistic information. • Rationale: word recognition rates will increase if phone accuracy improves • BUT: not all our problems can be solved
Phonetic features in ASR • Assumption: phone accuracy can be improved by deriving phonetic features from the spectral representation of the speech signal • What are phonetic features?
A phonetic description of sounds • The articulatory organs
A phonetic description of sounds • The articulation of consonants velum (= soft palate) tongue
A phonetic description of sounds • The articulation of vowels
Phonetic features: IPA • IPA (International Phonetic Alphabet) chart- consonants and vowels- only phonemic distinctions(http://www.arts.gla.ac.uk/IPA/ipa.html)
IPA features (obstruents) l d a p v u g p f n l a t v a e l a e v l l r a a p r o b n v l l u o o i s t r i i p0 0 0 0 0 0 -1 -1 1 0 0 0 0 0 -1 b0 0 -1 0 0 0 -1 -1 1 0 0 0 0 0 1 p 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 t -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 k -1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 b 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 d -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 g -1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1 1 f 1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 T -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 s -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 S -1 -1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 C -1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 x -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 vfri 1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 vapr 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 Dfri -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 z -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 1 Z -1 -1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 1
IPA features (sonorants) l d a p v u g p f n l a t v a e l a e v l l r a a p r o b n v l l u o o i s t r i i m 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 n -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 J -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 N -1 -1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 1 l -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 1 L -1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 1 -1 1 rret -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 ralv -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 Ruvu -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 1 j -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 w 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 h -1 -1 -1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A zero value is assigned to all vowel features (not listed here)
IPA features (vowels) m o f c r m o f c r i p r e o i p r e o d e o n ud e o n u i -1 -1 1 -1 -1 I -1 -1 1 1 -1 y -1 -1 1 -1 1 Y -1 -1 1 1 1 u -1 -1 -1 -1 1 U -1 -1 -1 1 1 e 1 -1 1 -1 -1 2 1 -1 1 -1 1 o 1 -1 -1 -1 1 O 1 1 -1 -1 1 V 1 1 -1 -1 -1 Q -1 1 -1 -1 1 Uschwa 1 -1 -1 1 1 { -1 1 1 -1 -1 a -1 1 1 1 -1 A -1 1 -1 -1 -1 E 1 1 1 -1 -1 9 1 1 1 -1 1 3 1 1 1 1 -1 @ 1 1 -1 1 -1 6 -1 1 -1 1 -1 A zero value is assigned to all consonant features (not listed here)
Phonetic features • Phonetic features- different systems (JFH, SPE, art. feat.)- distinction between “natural classes” which undergo the same phonological processes
SPE features (obstruents) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n p0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 -1 -1 -1 1 b0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 1 -1 -1 -1 p 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 -1 -1 -1 1 b 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 1 -1 -1 -1 tden 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1 -1 -1 1 t 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1 -1 -1 1 d 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 1 -1 -1 -1 k 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 -1 -1 -1 1 g 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 1 -1 -1 -1 f 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 -1 -1 1 1 vfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 1 -1 1 -1 T 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1 -1 -1 1 Dfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1 -1 -1 -1 s 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1 -1 1 1 z 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1 -1 1 -1 S 1 -1 -1 -1 -1 1 0 -1 -1 -1 1 1 -1 -1 1 1 Z 1 -1 -1 -1 -1 1 0 -1 -1 -1 1 1 1 -1 1 -1 C 1 -1 -1 -1 -1 1 0 -1 -1 -1 -1 1 -1 -1 1 1 x 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 1 -1 -1 1 1
SPE features (sonorants) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n m 1 -1 1 1 -1 -1 0 -1 -1 1 -1 -1 1 -1 -1 0 n 1 -1 1 1 -1 -1 0 -1 -1 1 1 -1 1 -1 -1 0 J 1 -1 1 1 -1 1 0 -1 -1 -1 -1 -1 1 -1 -1 0 N 1 -1 1 1 -1 1 0 1 -1 -1 -1 -1 1 -1 -1 0 l 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1 1 -1 0 L 1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1 1 -1 0 ralv 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1 -1 -1 0 Ruvu 1 -1 -1 1 -1 -1 0 1 -1 -1 -1 1 1 -1 -1 0 rret 1 -1 -1 1 -1 -1 0 -1 -1 -1 1 1 1 -1 -1 0 j -1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1 -1 -1 0 vapr -1 -1 -1 1 -1 -1 0 -1 -1 1 -1 1 1 -1 -1 0 w -1 -1 -1 1 -1 1 0 1 1 1 -1 1 1 -1 -1 0 h -1 -1 -1 1 1 -1 0 -1 -1 -1 -1 1 -1 -1 -1 0 XXX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SPE features (vowels) c s n s l h c b r a c c v l s t n y a o o i e a o n o n o a t e s l s n w g n c u t r t i t r n i -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 I -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 e -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 1 E -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 { -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 -1 a -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 1 y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1 -1 -1 1 Y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1 -1 -1 -1 2 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 -1 1 9 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1 -1 -1 -1 A -1 1 -1 1 1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 Q -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 -1 V -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 O -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 -1 -1 -1 o -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 U -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 -1 -1 u -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 Uschwa -1 1 -1 1 -1 -1 1 -1 1 -1 -1 1 1 -1 -1 -1 3 -1 1 -1 1 -1 -1 1 -1 -1 -1 -1 1 1 -1 -1 1 @ -1 1 -1 1 -1 -1 1 -1 -1 -1 -1 1 1 -1 -1 -1 6 -1 1 -1 1 1 -1 1 -1 -1 -1 -1 1 1 -1 -1 -1
Discuss one of five feature matrices Practical: CONCLUSION • Different feature matrices have different implications for relations between phones
Kohonen networks • Kohonen networks are unsupervised neural networks • Our Kohonen networks take vectors of acoustic parameters (MFCC_E_D) as input and output phonetic feature vectors • Network size: 50 x 50 neurons
Training the Kohonen network 1. Self-organisation results in a phonotopic map 2. Phone calibration attaches array of phones to each winning neuron 3. Feature calibration replaces array of phones by array of phonetic feature vectors 4. Averaging of phonetic feature vectors for each neuron
Mapping with the Kohonen network • Acoustic parameter vector belonging to one frame activates neuron • Weighted average of phonetic feature vector attached to winning neuron and K-nearest neurons is output
Advantages of Kohonen networks • Reduction of features dimensions possible • Mapping onto linguistically meaningful dimensions (phonetically less severe confusions) • Many-to-one mapping allows mapping of different allophones (acoustic variability) onto the same phonetic feature values • automatic and fast mapping
Disadvantages of Kohonen networks • They need to be trained on manually segmented and labelled material • BUT: cross-language training has been shown to be succesful
Hybrid ASR system phone lexicon hidden Markov modelling language model phonetic features BASELINE BASELINE Kohonen network Kohonen network Kohonen network MFCC’s + energy delta parameters phone
Train Kohonen network and perform acoustic-phonetic mapping Practical: CONCLUSION • Acoustic-phonetic mapping extracts linguistically relevant information from the variable input signal.
ICSLP’98 Exploiting transitions and focussing on linguistic properties for ASR Jacques KoremanWilliam J. BarryBistra Andreeva Institute of Phonetics, University of the SaarlandSaarbrücken, Germany
Variation in the speech signal caused by coarticulat-ion between sounds is one of the main challenges in ASR. • Exploit variation if you cannot reduce itCoarticulatory variation causes vowel transitions to be acoustically less homogeneous, but at the same time provides information about neighbour-ing sounds whichcan be exploited (experiment 1). • Reduce variation if you cannot exploit itSome of the variation is not relevant for the phon-emic identity of the sounds. Mapping of acousticparameters onto IPA-based phonetic features like[± plosive] and [± alveolar] extracts only linguist-ically relevant properties before hidden Markov modelling is applied (experiment 2). INTRODUCTION
INTRODUCTION No lexicon or language model The controlled experiments presented here reflect our general aim of using phonetic knowledge to improve the ASR system architecture. In order to evaluate the effect of the changes in bottom-up processing, no lexicon or language model is used. Both improve phone identification in a top-down manner by preventing the identification of inadmissible words (lexical gaps or phonotactic restrictions) or word sequences.
DATA Texts English, German, Italian and Dutch texts from the EUROM0 database, read by 2 male + 2 female speakers per language Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97
DATA Signals • 12 mel-frequency cepstral coefficients (MFCC’s) • energy • corresponding delta parameters Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 16 kHz microphone signals
DATA Labels • Intervocalic consonants labelled with SAMPA symbols, except plosives and affricates, which are divided into closure and frication subphone units • 35-ms vowel transitions labelled asi_lab, alv_O (experiment 1)V_lab, alv_V (experiment 2) where lab, alv = cons. generalized across placeV = generalized vowel Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97
consonant lexicon hidden Markov modelling language model Hamming window: 15 ms step size: 5 ms pre-emphasis: 0.97 BASELINE MFCC’s + energy +delta parameters MFCC’s + energy +delta parameters C Voffset - C - Vonset EXPERIMENT 1: SYSTEM