290 likes | 418 Views
OSU ASAT Status Report. Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006. Personnel changes. Jeremy and Yu are not currently on the project Jeremy is being funded on AFRL/DAGSI project Lexicon learning from orthography
E N D
OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006
Personnel changes • Jeremy and Yu are not currently on the project • Jeremy is being funded on AFRL/DAGSI project • Lexicon learning from orthography • However, he is continuing to help in spare time • Yu is currently in transition • New student (to some!): Ilana Bromberg • Technically funded as of 10/1 but did some experiments for an ICASSP paper in Sept. • Still sorting out project for this year
Future potential changes • May transition in another student in WI 06 • Carry on further with some of Jeremy’s experiments
What’s new? • First pass on the parsing framework • Last time: talked about different models • Naïve Bayes, Dirichlet modeling, MaxEnt models • This time: settled on Conditional Random Fields framework • Monophone CRF phone recognition beating triphone HTK recognition using attribute detectors • Ready for your inputs! • More boundary work • Small improvements seen in integrating boundary information into HMM recognition • Still to be seen if it helps CRFs
Parsing • Desired: ability to combine the output of multiple, correlated attribute detectors to produce • Phone sequences • Word sequences • Handle both semi-static & dynamic events • Traditional phonological features • Landmarks, boundaries, etc. • CRFs are good bet for this
Conditional Random Fields • A form of discriminative modelling • Has been used successfully in various domains such as part of speech tagging and other Natural Language Processing tasks • Processes evidence bottom-up • Combines multiple features of the data • Builds the probability P( sequence | data) • Computes joint probability of sequence given data • Minimal assumptions about input • Inputs don’t need to be decorrelated • cf. Diagonal Covariance HMMs
Transition functions add associations between transitions from one label to another State functions help determine the identity of the state Conditional Random Fields /k/ /k/ /iy/ /iy/ /iy/ • CRFs are based on the idea of Markov Random Fields • Modelled as an undirected graph connecting labels with observations • Observations in a CRF are not modelled as random variables X X X X X
State Feature Weight =10 One possible weight value for this state feature (Strong) Transition Feature Weight =4 One possible weight value for this transition feature State Feature Function f([x is stop], /t/) One possible state feature function For our attributes and labels Transition Feature Function g(x, /iy/,/k/) One possible transition feature function Indicates /k/ followed by /iy/ Conditional Random Fields • Hammersley-Clifford Theorem states that a random field is an MRF iff it can be described in the above form • The exponential is the sum of the clique potentials of the undirected graph
Conditional Random Fields • Conceptual Overview • Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label • A positive value if the attribute appears in the data • A zero value if the attribute is not in the data • Each feature function carries a weight that gives the strength of that feature function for the proposed label • High positive weights indicate a good association between the feature and the proposed label • High negative weights indicate a negative association between the feature and the proposed label • Weights close to zero indicate the feature has little or no impact on the identity of the label
Experimental Setup • Attribute Detectors • ICSI QuickNet Neural Networks • Two different types of attributes • Phonological feature detectors • Place, Manner, Voicing, Vowel Height, Backness, etc. • Features are grouped into eight classes, with each class having a variable number of possible values based on the IPA phonetic chart • Phone detectors • Neural networks output based on the phone labels – one output per label • Classifiers were applied to 2960 utterances from the TIMIT training set
Experimental Setup • Output from the Neural Nets are themselves treated as feature functions for the observed sequence – each attribute/label combination gives us a value for one feature function • Note that this makes the feature functions non-binary features. • Different than most NLP uses of CRFs • Along lines of Gaussian-based CRFs (e.g., Microsoft)
Experiment 1 • Goal: Implement a Conditional Random Field Model on ASAT-style phonological feature data • Perform phone recognition • Compare results to those obtained via a Tandem HMM system
Experiment 1 - Results • CRF system trained on monophones with these features achieves accuracy superior to HMM on monophones • CRF comes close to achieving HMM triphone accuracy • CRF uses many many fewer parameters
Experiment 2 • Goals: • Apply CRF model to phone classifier data • Apply CRF model to combined phonological feature classifier data and phone classifier data • Perform phone recognition • Compare results to those obtained via a Tandem HMM system
Experiment 2 - Results Note that Tandem HMM result is best result with only top 39 features following a principal components analysis
Experiment 3 • Goal: • Previous CRF experiments used phone posteriors for CRF, and linear outputs transformed via a Karhunen-Loeve (KL) transform for the HMM sytem • This transformation is needed to improve the HMM performance through decorellation of inputs • Using the same linear outputs as the HMM system, do our results change?
Experiment 3 - Results Also shown – Adding both feature sets together and giving the system supposedly redundant information leads to a gain in accuracy
Experiment 4 • Goal: • Previous CRF experiments did not allow for realignment of the training labels • Boundaries for labels provided by TIMIT hand transcribers used throughout training • HMM systems allowed to shift boundaries during EM learning • If we allow for realignment in our training process, can we improve the CRF results?
Experiment 4 - Results Allowing realignment gives accuracy results for a monophone trained CRF that are superior to a triphone trained HMM, with fewer parameters
Code status • Current version: java-based, multithreaded • TIMIT training takes a few days on 8 proc. machine • At test time, CRF generates AT&T FSM lattice • Use AT&T FSM tools to decode • Will (hopefully) make it easier to decode words • Code is stable enough to try different kinds of experiments quickly • Ilana joined the group and ran an experiment within a month
Joint models of attributes • Monica’s work showed that modeling attribute detection with joint detectors worked better • e.g. modeling manner/place jointly better • cf. Chang et al: hierarchical detectors work better • This study: can we improve phonetic attribute-based detection by using phone classifiers and summing? • Phone classifer: ultimate joint modeling
Independent vs Joint Feature Modeling • Baseline 1: • 61 phone posteriors (joint modeling) • Baseline 2: • 44 feature posteriors (independent modeling) • Experiment: • Feature posteriors derived from 61 phone posteriors • In each frame: weight for each feature = summed weight of each phone exhibiting that feature • e.g. P(stop) = P(/p/) + P(/t/) + P(/k/) + P(/b/) + P(/b/) + P(/g/)
Continued work on phone boundary detection • Basic idea: eventually we want to use these as transition functions in CRF • CRF still under development when this study done • Added features corresponding to P(boundary|data) to HMM
Phone Boundary Detection Evaluation and Results • Using phonological features as an input representation was modestly better than the phone posterior estimates themselves. • Phonological feature representations also seemed to edge out direct acoustic representations • Phonological feature MLPs are more complex to train. • The nonlinear representations learned by the MLP were better for boundary detection than metric-based methods. Input features: Phonological features, acoustic features(PLP) and Phone classifier outputs Classification methods: MLP metric-based method
Proposed Five-state HMM model Experiments • For simplicity, the linear outputs from the PLP+MLP detector, were used as the phone boundary features, instead of the ones from the features+ MLP detector. Several experiments were conducted: 0) Baseline system: standard 39 MFCCs • MFCCs+ phone boundary features. (no KLT) • MFCCs + phone boundary features, which were decorrelated using Karhunen-Loeve transformation (KLT). 3) MFCCs + phone boundary features, with a KL-transformation over all features. 4) MFCCs (KLTed) to show the effect of KL transformation on MFCCs • The training and recognition were conducted with the HTK toolkit, on TIMIT data set. When reaching the 4-mixture stage, some experiment failed due to data sparsity. We adopted a hybrid 2/4 mixture strategy, promoting triphones to 4-mixture when the data was sufficient. How to incorporating phone boundaries, estimated by Multi-layer perceptron (MLP), into an HMM system. Five-state HMM phone model to capture boundary information In order to integrate phone boundary information in speech recognition, phone boundary information were concatenated to MFCCs as additional input features. We explicitly modeled the entering and exiting state of a phone as a separate, one frame distribution. The proposed 5-state HMM phone model is introduced below. The two additional boundary states were intended to catch phone-boundary transitions, while the three self-looped states in the center can model phone-internal information. Escape arcs were also included to bypass the boundary states for short phones.
Inputs 3-state phone recognition acc. 5-state phone recognition acc. Baseline: MFCC 62.37% 63.41% 1) MFCC + Boundaries 61.25 62.79 2) MFCC + KLT(Boundaries) 62.47 (67.22 16-mix) 63.78(68.02 16-mix) 3) KLT(MFCC + Boundaries) 63.20 64.38 4) KLT(MFCC) 62.70 - Results & Conclusion Conclusion Phonological features perform better as inputs to phone boundary classifiers than acoustic features. The results suggest that the pattern changes in the phonological feature space may lead to robust boundary detection. By exploring the potential space of representations of boundaries, we argue that phonetic transitions are very important for automatic speech recognition. HMMs can be attuned to the transition of phone boundaries by explicitly modeling phone transition states. Also, the combined strategy of binary boundary features, KLT, and 5-state representations gives almost a 2% absolute improvement in phone recognition. Considering the boundary information we integrated is one of the simplest representations, the result is rather encouraging. In future work, we hope to integrate phone boundary information as additional features to CRF. Results • The proposed 5-state HMM models performed better than their conventional 3-state counterparts on all training datasets. • Decorrelation improved the accuracy of recognition on binary boundaries. • Including MFCCs in the decorrelation improved recognition further. • For comparison, several experiments were also conducted on a 5-state HMM with a traditional, left-to-right all-self-loops transition matrix. The results showed vastly increased deletions, indicating a bias against short duration phones, whereas the proposed model is balanced between insertions and deletions. • Recently, I modified the decision tree questions in the tied-state triphone step, and pushed the model to 16-mix Gaussians. Part of the results are also shown in the table.