200 likes | 469 Views
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010. Mandarin Tone Recognition using Affine-Invariant Prosodic Features and Tone Posteriorgram. Speaker: Hsiao- Tsung Hung. 1.Introduction. Introduction. Tone recognition are definitely influenced by as least the following: Speaker
E N D
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Mandarin Tone Recognition using Affine-Invariant Prosodic Features and TonePosteriorgram Speaker:Hsiao-Tsung Hung
Introduction • Tone recognition are definitely influenced by as least the following: • Speaker • The “prosodic state” • Co-articulation effect
Introduction • Although the tones depend heavily on many intra-syllabic and prosodic behaviors which are definitely speaker dependent, the native speaker of Mandarin can easily recognize the tones • This implies the tones should be classified by some “robust” prosodic cues, which remain useful across many different conditions.
Introduction • in this paper we try to introduce robustness into prosodic features by different feature normalization schemes, based on the concept of affine invariance property proposed in recent years • We also incorporate the prosodic features with the context information by tone posteriorgram analogous to the TANDEM system for speech recognition.
Affine Invariance property • Consider an n-dimensional feature vector sequence along the time axis. If a certain change of condition over these feature vectors is stationary within some period of time, and can be represented as an affine translation:
Affine Invariance property • There may exist some features obtained from which remain invariant under such change of conditions: ,where is the feature function.
Affine invariance for normalized pitch features • Assume the transformation between the pitch contours forthe same syllable for two speakers, and , can beapproximated by an affine transform: (assume here)
Affine invariance for normalized pitch features • relationship between the utterance-level means and standard deviation:
Affine invariance for normalized pitch features • Any feature function M() applied to this normalized pitch contour is automatically affine-invariant.
Invariance of duration and energy features • Duration • Energy • difference for two adjacent syllables
Tone recognition 21-dimensional prosodic feature vector SVM Enh1 : current syllable Enh2 : current, preceding and following syllable
Corpus and experiment setup • Sinica Continuous Speech Prosody Corpora (COSPRO) • Contained 4672 utterances (more than 60,000 syllables), produced by 38 male and 40 female native speakers. • SVM tone recognizers.