1 / 1

Modeling Prosody for Language Identification on Read and Spontaneous Speech

Modeling Prosody for Language Identification on Read and Spontaneous Speech. 8. Frequency (kHz). 0. el. a. m. E. . E. t. . e. b. . n. Amplitude. 0. 0.2. 0.4. 0.6. 0.8. 1.0. Time (s). CCVV CCV CV CCCV CV. Vowel. Non Vowel. Pause.

alima
Download Presentation

Modeling Prosody for Language Identification on Read and Spontaneous Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Prosody for Language Identification on Read and Spontaneous Speech 8 Frequency (kHz) 0 el a m E  E t  e b  n Amplitude 0 0.2 0.4 0.6 0.8 1.0 Time (s) CCVV CCV CV CCCV CV Vowel Non Vowel Pause • Rhythm : • Duration C • Duration V • Complexity C • Intonation : • Skewness(F0) • Kurtosis(F0) Jean-Luc ROUAS1,Jérôme FARINAS1, François PELLEGRINO2 and Régine ANDRÉ-OBRECHT1 {rouas, jfarinas, obrecht}@irit.fr; pellegrino@univ-lyon2.fr 1Institut de Recherche en Informatique de Toulouse UMR 5505 CNRS - Université Paul Sabatier - INP 31062 Toulouse Cedex 4 - France 2Laboratoire Dynamique du Langage UMR 5596 CNRS - Université Lumière Lyon 2 69363 Lyon Cedex 7 - France Pseudo Syllable • Speech segmentation: statistical segmentation (André-Obrecht, 1988) • Shorts segments (bursts and transient parts of sounds) • Longer segments (steady parts of sounds) • Speech Activity Detection and Vowel detection (Pellegrino & Obrecht, 2000) • Spectral analysis of the signal • Language and speaker independent algorithm • Pseudo Syllable segmentation • Derived from the most frequent syllable structure in the world: CV • The speech signal is parsed in patterns matching the structure: • Cn V (n integer, can be 0). Signal Duration Parameters • 3 parameters are computed: • Global consonantal segments duration • Global vocalic segment duration • Syllable complexity (Nc: number of consonantal segments in the pseudo-syllable) Pseudo syllable generation Speech Segmentation Speech activity detection Intonation Parameters Vowel detection Recognition Fundamental frequency extraction: « MESSIGNAIX » toolbox: combination of three methods (amdf, spectral comb, autocorrelation) Fundamental frequency modeling: Computation of statistics on each pseudo-syllable: skewness & kurtosis The prosodic modeling uses Gaussian Mixture Models (GMM) on a set of 9 parameters extracted from each pseudo-syllable: Dc, Dv, Nc, F0 mean, F0 variance, F0 skewness, F0 kurtosis, the accent location, the F0 bandwidth. Language specific models are learned using VQ and EM algorithms on learning subsets of the corpus. Duration parameters extraction Intonation parameters extraction L1 model L2 model Model Recognition Item Language Experiments Results of the prosodic system on read speech (MULTEXT corpus) • Language identification on read speech: • Experiments were previously made on the five languages of the MULTEXT database: English, French, German, Italian and Spanish. Japanese was added thanks to Mr. Kitasawa. The tests are made using 20 seconds read speech utterances and consist in a six-way identification task. On the read speech corpus, our system can achieve good performance (79 % of correct identification on six languages). The main confusion are between English and German (both stress timed languages), and Spanish and Italian. • Language identification on spontaneous speech: • Experiments are made on ten languages of the OGI Multilingual Telephone Speech Corpus: English, Farsi, French, German, Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese. The tests are made using the 45 seconds spontaneous speech utterances and consist in a pair discrimination task. On the spontaneous speech corpus, the discrimination is easier to achieve between languages which does not belong to the same rhythmic and intonation classes. L2 L1 Conclusion As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the spontaneous speech prosody which seems to be too complex and with too much speaker variability for our features. Results of the prosodic system on spontaneous telephone speech (OGI MLTS corpus)

More Related