Automatic Phoneme Alignment

Automatic Phoneme Alignment Hung-Yi Lo, Jen-Wei Kuo and Hsin-Ming Wang NGASR Summer Meeting 2011 July 12, 2011

attitude Context In "A good attitude is unbeatable" Phonetic Transcription ae dx ax tcl t ux dcl Acoustic Signal Spectrogram Automatic Phoneme Alignment Problem

Outline • A Minimum Boundary Error Framework for Automatic Phoneme Alignment • Phonetic Boundary Refinement Using Support Vector Machine • Phoneme Boundary Refinement Using Ranking Methods

A Minimum Boundary Error Framework for Automatic Phoneme Alignment

boundary error of compared to posterior probability of summation of all possible alignments MBE Discriminative Training • Let be a set of various possible phonetic alignments for a given observation utterance O, and S is the true phonetic alignment in the canonical transcription. • Then, the expected boundary error is expressed as

Various possible hypothesized alignments generated by conventional HMM-based forced alignment approaches We want to select the alignment with the minimal expected boundary error from We examine each hypothesized alignment one by one MBE Forced Alignment (N-best rescoring) …

MBE Forced Alignment … If is the true alignment, then we get the error

MBE Forced Alignment … The expected boundary error of can be expresses as: The best alignment is found by minimizing the expected error

Experimental Setup • TIMIT corpus • Training set: 4546 utterances (3.87 hrs) • Test set: 1646 utterances (1.41 hrs) • Context independent HMMs are used • 50 phone models (left-to-right topology) • 3 states for each HMM • 5265 mixtures totally • Frame window size and shift are 20 ms and 5 ms, respectively • 12 MFCCs and log energy, and their first and second differences • CMS and CVN are applied • Evaluation metric • Percentage of boundaries placed without a tolerance

Experimental Results • Boundary Accuracy

Phonetic Boundary Refinement Using Support Vector Machine

Acoustic Signal Spectrogram Boundaries detected by HMM-based Forced Alignment Hypothesized Boundaries as Input to SVM SVM Boundary Classifier Output Boundary Refinement Using SVM

Feature Extraction • Each frame represented by a 45-dim vector: • 39 MFCC-based coefficients • Zero crossing rate • Bisector frequency • Burst degree • Spectral entropy • General weighted entropy • Subband energy • Two features are extracted for each hypothesis boundary: • Symmetrical Kullback-Leibler distance • Spectral feature transition rate • Each hypothesized boundary is represented by a 92-dim augmented vector consisting of elements of the feature vectors of the left and right frames next to it and the two boundary features.

Phone-Transition-Dependent Classifier • Training data is always limited • Cannot train a classifier for each type of phone transition • Many phone transitions have similar acoustic characteristics, we can partition them into clusters • The phone transitions with little training data can be covered by the classifiers of the categories they belong to • Two methods for phone transition clustering: • K-means-based Clustering (KM) • Decision-tree-based Clustering (DT)

Training Data for Boundary Classifier • Positive samples: the feature vectors associated with the true phone boundaries • Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries Positive Instance Negative Instance Negative Instance Classification Model

Experimental Results

Phonetic Boundary Refinement Using Ranking Methods

Training Data for Boundary Classifier • Positive samples: the feature vectors associated with the true phone boundaries • Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries However, this classification-based method has two drawbacks Positive Instance Negative Instance Negative Instance Classification Model

First Drawback: Losing Information (1/2) • Only information about the boundary and far away non-boundary signal characteristics is used • What about the information nearby the boundary? Positive Instance Negative Instance Negative Instance Classification Model

First Drawback: Losing Information (2/2) • Preference ranking • Instances extracted from the true boundaries: high preference • Nearby instances: medium preference • Far away instances: low preference Positive Instance Negative Instance Negative Instance Classification Model Preference Ranking Mid High Mid Low Low

Second Drawback : Imbalanced Training • A lot of negative instances but only a limited amount of positive instances • General classification algorithms will be biased to predict all instances to be negative • Since they are learned to minimize the number of incorrectly classified instances

Boundary Refinement as a Ranking Problem • Learn a function H: X → R where H(xi) > H(xj) means that instance xi is preferred to xj • The hypothesized boundary closed to the true boundary should have higher score • We only care about relative order • Correct order: A-B-C-D • OK: {A:-100, B:-10, C: 0, D:1000} • OK: {A: 0.1, B: 0.3, C: 0.4, D: 0.41} • We exploit two learning-to-rank methods: • Ranking SVM • RankBoost

Generation of Training Pairs • Four ordered ranking lists generated from each true boundary True Boundary …… ……

Generation of Training Pairs • Four ordered ranking lists generated from each true boundary …… …… (1) (2) (3) (4) …… ……

Generation of Training Pairs • Couple the preferred instances with each remaining instance …… …… (1) (2) (3) (4) …… ……

Experiment Results

References • Jen-Wei Kuo and Hsin-Min Wang, Minimum Boundary Error Training for Automatic Phonetic Segmentation, ICSLP 2006 • Jen-Wei Kuo and Hsin-Min Wang, A Minimum Boundary Error Framework for Automatic Phonetic Segmentation, ISCSLP 2006 • Hung-Yi Lo and Hsin-Min Wang, Phonetic Boundary Refinement Using Support Vector Machine, ICASSP 2007 • Hung-Yi Lo and Hsin-Min Wang, Phone Boundary Refinement Using Ranking Methods, ISCSLP 2010

Thank you!

Automatic Phoneme Alignment

Automatic Phoneme Alignment

Presentation Transcript

PHONEME

Automatic Image Alignment for 3D Environment Modeling

Automatic Image Alignment (direct)

Phoneme Awareness Skills (K-2)

Pronunciation Extraction Through Cross-Lingual Word-to-Phoneme Alignment

Phoneme Deletion Test

Phoneme Segmentation Test

The long vowel phoneme ie

Letter to Phoneme Alignment

Letter to Phoneme Alignment

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Automatic tilt series alignment

Accuracy of structure-based sequence alignment of automatic (structure-alignment) methods

Grapheme-to-Phoneme for Thai

Rules for phoneme mapping

Construction of phoneme-to-phoneme converters

Phoneme Alignment based on Discriminative Learning

Automatic Alignment system Improvements after the VSR1 M. Mantovani for the Alignment team

— Automatic Optical Alignment for AFM

Java phoneME CDC AMS

Semi-Automatic Ontology Alignment for Geospatial Data Integration*

SUPRASEGMENTAL PHONEME