290 likes | 506 Views
Automatic Phoneme Alignment. Hung-Yi Lo , Jen-Wei Kuo and Hsin-Ming Wang NGASR Summer Meeting 2011 July 12, 2011. attitude. Context. In "A good attitude is unbeatable". Phonetic Transcription. ae dx ax tcl t ux dcl. Acoustic Signal. Spectrogram.
E N D
Automatic Phoneme Alignment Hung-Yi Lo, Jen-Wei Kuo and Hsin-Ming Wang NGASR Summer Meeting 2011 July 12, 2011
attitude Context In "A good attitude is unbeatable" Phonetic Transcription ae dx ax tcl t ux dcl Acoustic Signal Spectrogram Automatic Phoneme Alignment Problem
Outline • A Minimum Boundary Error Framework for Automatic Phoneme Alignment • Phonetic Boundary Refinement Using Support Vector Machine • Phoneme Boundary Refinement Using Ranking Methods
A Minimum Boundary Error Framework for Automatic Phoneme Alignment
boundary error of compared to posterior probability of summation of all possible alignments MBE Discriminative Training • Let be a set of various possible phonetic alignments for a given observation utterance O, and S is the true phonetic alignment in the canonical transcription. • Then, the expected boundary error is expressed as
Various possible hypothesized alignments generated by conventional HMM-based forced alignment approaches We want to select the alignment with the minimal expected boundary error from We examine each hypothesized alignment one by one MBE Forced Alignment (N-best rescoring) …
MBE Forced Alignment … If is the true alignment, then we get the error
MBE Forced Alignment … If is the true alignment, then we get the error
MBE Forced Alignment … The expected boundary error of can be expresses as: The best alignment is found by minimizing the expected error
Experimental Setup • TIMIT corpus • Training set: 4546 utterances (3.87 hrs) • Test set: 1646 utterances (1.41 hrs) • Context independent HMMs are used • 50 phone models (left-to-right topology) • 3 states for each HMM • 5265 mixtures totally • Frame window size and shift are 20 ms and 5 ms, respectively • 12 MFCCs and log energy, and their first and second differences • CMS and CVN are applied • Evaluation metric • Percentage of boundaries placed without a tolerance
Experimental Results • Boundary Accuracy
Phonetic Boundary Refinement Using Support Vector Machine
Acoustic Signal Spectrogram Boundaries detected by HMM-based Forced Alignment Hypothesized Boundaries as Input to SVM SVM Boundary Classifier Output Boundary Refinement Using SVM
Feature Extraction • Each frame represented by a 45-dim vector: • 39 MFCC-based coefficients • Zero crossing rate • Bisector frequency • Burst degree • Spectral entropy • General weighted entropy • Subband energy • Two features are extracted for each hypothesis boundary: • Symmetrical Kullback-Leibler distance • Spectral feature transition rate • Each hypothesized boundary is represented by a 92-dim augmented vector consisting of elements of the feature vectors of the left and right frames next to it and the two boundary features.
Phone-Transition-Dependent Classifier • Training data is always limited • Cannot train a classifier for each type of phone transition • Many phone transitions have similar acoustic characteristics, we can partition them into clusters • The phone transitions with little training data can be covered by the classifiers of the categories they belong to • Two methods for phone transition clustering: • K-means-based Clustering (KM) • Decision-tree-based Clustering (DT)
Training Data for Boundary Classifier • Positive samples: the feature vectors associated with the true phone boundaries • Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries Positive Instance Negative Instance Negative Instance Classification Model
Phonetic Boundary Refinement Using Ranking Methods
Training Data for Boundary Classifier • Positive samples: the feature vectors associated with the true phone boundaries • Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries However, this classification-based method has two drawbacks Positive Instance Negative Instance Negative Instance Classification Model
First Drawback: Losing Information (1/2) • Only information about the boundary and far away non-boundary signal characteristics is used • What about the information nearby the boundary? Positive Instance Negative Instance Negative Instance Classification Model
First Drawback: Losing Information (2/2) • Preference ranking • Instances extracted from the true boundaries: high preference • Nearby instances: medium preference • Far away instances: low preference Positive Instance Negative Instance Negative Instance Classification Model Preference Ranking Mid High Mid Low Low
Second Drawback : Imbalanced Training • A lot of negative instances but only a limited amount of positive instances • General classification algorithms will be biased to predict all instances to be negative • Since they are learned to minimize the number of incorrectly classified instances
Boundary Refinement as a Ranking Problem • Learn a function H: X → R where H(xi) > H(xj) means that instance xi is preferred to xj • The hypothesized boundary closed to the true boundary should have higher score • We only care about relative order • Correct order: A-B-C-D • OK: {A:-100, B:-10, C: 0, D:1000} • OK: {A: 0.1, B: 0.3, C: 0.4, D: 0.41} • We exploit two learning-to-rank methods: • Ranking SVM • RankBoost
Generation of Training Pairs • Four ordered ranking lists generated from each true boundary True Boundary …… ……
Generation of Training Pairs • Four ordered ranking lists generated from each true boundary …… …… (1) (2) (3) (4) …… ……
Generation of Training Pairs • Couple the preferred instances with each remaining instance …… …… (1) (2) (3) (4) …… ……
References • Jen-Wei Kuo and Hsin-Min Wang, Minimum Boundary Error Training for Automatic Phonetic Segmentation, ICSLP 2006 • Jen-Wei Kuo and Hsin-Min Wang, A Minimum Boundary Error Framework for Automatic Phonetic Segmentation, ISCSLP 2006 • Hung-Yi Lo and Hsin-Min Wang, Phonetic Boundary Refinement Using Support Vector Machine, ICASSP 2007 • Hung-Yi Lo and Hsin-Min Wang, Phone Boundary Refinement Using Ranking Methods, ISCSLP 2010