1 / 19

shlomo
Download Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 􀀀ACOUSTIC-PHONETIC UNIT SIMILARITIESFOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory, UMR CNRS 5524 BP 53, 38041 Grenoble Cedex 9, FRANCE ** Interactive Systems Laboratories Carnegie Mellon University, Pittsburgh, PA, USA Presenter: Hsu-Ting Wei

  2. Outline • 1. Introduction • 2. Acoustic-phonetic unit similarities • 2.1. Phoneme Similarity • 2.2. Polyphone Similarity • 2.3. Clustered Polyphonic Model Similarity • 3. Experimental and results • 4. Conclusion

  3. 1. Introduction • There are at less 6900 languages in the world. • We are interested in new techniques and tools for rapid portability of speech recognition systems (ex:ASR) when only limited resources are available.

  4. 1. Introduction (cont.) • In crosslingual acoustic modeling, previous approaches have been limited to context independent models. • Monophonic acoustic models in target language were initialized using seed models from source language. • Then, these initial models could be rebuilt or adapted using training data from the target language. • The cross-lingual context dependent acoustic modeling portability and adaptation can be investigated.

  5. 1. Introduction (cont.) • 1996, J. Kohler • J. Kohler used HMM distances to calculate the similarity between two monophonic models. • Method : • To measure the distance or the similarity of two phoneme models we use a relative entropy-based distance metric. • During training the parameters of the mixture Laplacian density phoneme models are estimated. • Further, for each phoneme a set of phoneme tokens X is extracted from a test or development corpus. • Given two phoneme models and and given the sets of phoneme tokens or observations Xiand Xj the distance between model and is defined by:

  6. 1. Introduction (cont.) • 2000, B. Imperl • proposed that a triphone similarity estimation method based on phoneme distances and used an agglomerative (bottom-up) clustering process to define a multilingual set of triphones. • Method: Confusion matrix

  7. 1. Introduction (cont.) • 2001, T. Schultz • proposed PDTS (Polyphone Decision Tree Specialization) to overcome the problem which the context mismatch across languages increases dramatically for wider contexts. • PDTS : data-driven method (↓) • In PDTS, the clustered multilingual polyphone decision tree is adapted to the target language by restarting the decision tree growing process according to the limited adaptation data in the target language

  8. 1. Introduction (cont.) • In this paper, we investigate a new method for this crosslingual transfer process. • We do not use the existing decision tree in source language but build a new decision tree just with a small amount of data from the target language. • Then, based on the acoustic-phonetic unit similarities, some crosslingual transfer and adaptation processes are applied.

  9. 2. Acoustic-phonetic unit similarities • 2.1. Phoneme Similarity • 2.1.1 Data-driven methods • 2.1.2 Proposed knowledge-based method • 2.1.1 Data-driven methods(↓) • The acoustic similarity between two phonemes can be obtained automatically by calculating the distance between two acoustic models • HMM distance (Entropy) • Kullback-Leibler distance • Bhattacharyya distance • Euclidean distance • Confusion matrix

  10. 2. Acoustic-phonetic unit similarities (cont.) • 2.1.2 Proposed knowledge-based method(↑) • Step 1: Top-down classification • Step 2: Bottom-up estimation • Step 1: Top-down classification • Figure 1 shows a hierarchical graph where each node is classified into different layers. • k = number of layers • Gi = user defined similarity value for layer i (i = 0...k-1). • In our work, we investigated several settings of k and Gi and set G = {0.9; 0.45; 0.25; 0.1; 0.0} with k = 5 based on a cross-evaluation in crosslingual acoustic modeling experiments. G0 = 0.9 G1 = 0.45 G2 = 0.25 G3 = 0.1 G4 = 0.0

  11. G0 = 0.9 G1= 0.45 G2 = 0.25 G3 = 0.1 G4 = 0.0 2. Acoustic-phonetic unit similarities (cont.) • Step 2: Bottom-up estimation • d( [i], [u] ) = G2 (=0.25 in this experiment)

  12. 2. Acoustic-phonetic unit similarities (cont.) • 2.2 Polyphone Similarity • Let S be the phoneme set in source language, T be the phoneme set in target language. Final, find the nearest polyphones Ps*

  13. 2. Acoustic-phonetic unit similarities (cont.) • 2.3 Clustered Polyphonic Model Similarity • Since the number of polyphones in a language is very large , a limited training corpus usually does not cover enough occurrences of every polyphones. • A decision tree-based clustering (figure 3) or an agglomerative clustering procedure (figure 1) is needed to cluster and model similar polyphones in a clustered polyphonic model

  14. 2. Acoustic-phonetic unit similarities (cont.) • 2.3 Clustered Polyphonic Model Similarity (cont.)

  15. 3. Experimental and results • Syllable-based ASR system • In order to build a polyphonic decision tree and to adapt the crosslingual acoustic models, • Baseline • 2.25 hours of data spoken by 8 speakers were used. (Vietnamese speech) • Crosslingual methods • Speech data from six languages were used to build these models: Arabic, Chinese, English, German, Japanese and Spanish. • Adapted by Vietnamese speech

  16. 3. Experimental and results (cont.) Initial model test 越南人說越南話 (2.5小時) test 加入越南人說越南話 (2.5小時)調適 六國人說自己本國話(也許100小時) New models

  17. 3. Experimental and results (cont.) • As the number of clustered sub-models increases, SER of the baseline system increases proportionally since the amount of data per model decreases due to the limited training data. • However, the crosslingual system is able to overcome this problem by indirectly using data in other languages.

  18. 3. Experimental and results (cont.) • The influence of adaptation data size and the number of speakers on the baseline system and two methods of phoneme similarity estimation: • knowledge-based • data-driven using confusion matrix • We find that the knowledge based method outperforms the data-driven method.

  19. 4. Conclusion • This paper presents different methods of estimating the similarities between two acoustic-phonetic units. • The potential of our method is demonstrated even though the use of trigrams the in syllable-based language modeling might be insufficient to obtain acceptable error rates

More Related