250 likes | 507 Views
Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks. Colin Tan School of Computing, National University of Singapore. ctank@comp.nus.edu.sg. Motivations. Humans are able to automatically segment words and sounds in speech with little difficulty.
E N D
Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore. ctank@comp.nus.edu.sg
Motivations • Humans are able to automatically segment words and sounds in speech with little difficulty. • The ability to automatically segment words and phonemes also useful in training speech recognition engines.
Principle • Time-Delay Neural Network • Input nodes have shift registers that allow the TDNN to generalize not only between discrete input-output pairs, but also over time. • Ability to learn true word boundaries given reasonably good initial estimations. • We make use of this property for our work.
Why TDNN? • Representational simplicity • Intuitively easy to understand what inputs to TDNN and outputs to TDNN represent. • Ability to generalize over time • Hidden Markov Models have been left out of this work for now.
Time Delay Neural Networks • Diagram shows a 2-input TDNN node. • Constrained weights allow generalization over time.
Boundary Shift Algorithm • Initially: • The TDNN is trained on a small manually segmented set of data. • Given the expected number of words in a new, unseen utterance, the cepstral frames in the utterance is distributed evenly over all the words. • For example, if there are 2,000 frames and 10 expected words, each word is allocated 200 frames. • Convex-Hull and Spectral Variation Function methods may be used to estimate the number of words in the utterance. • For our experiments we manually counted the number of words in each utterance.
Boundary Shift Algorithm • The minimally trained TDNN is retrained using both its original data and the new unseen data. • After retraining, a variable-sized window is placed around each boundary. • Window is initially +/- 16 frames • A search is made within the window for the highest scoring frame. The boundary is shifted to that frame. • This search is allowed to search past boundaries into neighboring words. • TDNN is retrained using new boundaries.
Boundary Shift Algorithm • Windows are adjusted by +/- 2 frames (i.e. reduced by a total of 4 frames), and steps 3 to 5 are repeated. • Algorithm ends when boundary shifts are negligible, or windows shrink to 0 frames.
Network Pruning • Limited training data lead to the problem of over-fitting. • Three parameters are used to decide which TDNN nodes to prune. • Significance j(max), , which measures how much a particular node contributes to the final answer. A node with a small Significance value contributes little to the final answer and can be pruned.
Network Pruning • Three parameters are used to prune the TDNN: • The variance j, which measures how much a particular node changes over all the inputs. A node that changes very little over all the inputs is not contributing to the learning, and can be removed. • Pairwise node distance ji, which measures how node changes with respect to another.A node that follows another node closely in value is redundant and can be removed.
Network Pruning • Thresholds are set for each parameter. Nodes with parameters falling below these thresholds are pruned. • Selection of thresholds is critical. • Pruning is performed after the TDNN has been trained on the initial set for about 200 cycles.
Experiments • TDNN Architecture • 27 Inputs • 13 dcep coefficients, 13 ddcep coefficients, power. • 5 input delays • 96 Hidden Nodes • Arbitrarily chosen, to be pruned later. • 2 Binary Output Nodes • Represents word start and end boundaries.
Experiments • Data gathered from 6 speakers • 3 male, 3 female. • Solving task similar to CISD Trains Scheduling Problem (Ferguson 96). • About 20-30 minutes of speech used to train TDNN. • 20 utterances, previously unseen, chosen to evaluate performance.
Inside Test Outside Test Precision: 66.88% Precision: 56.22% Recall: 67.33% Recall: 76.69% F-Number: 67.07% F-Number: 64.88% Experiment ResultsPerformance Before Pruning • Results shown relative to hand-labeled samples.
Inside Test Outside Test Precision: 66.03% Precision: 57.10% Recall: 61.41% Recall: 72.16% F-Number: 63.61% F-Number: 61.71% Experiment ResultsPerformance After Pruning
Subject: CK Utterance: Ok thanks, now I need to find out how long does it need to travel from Elmira to Corning (okay) (th-) (-anks) (now) (i need) (to) (find) (out how) (long) (does it need) (to) (travel) (f-) (-om) (emira) (to c-) (orning) Example Utterances
Subject: CT Utterance: May I know how long it takes to travel from Elmira to Corning? (may i) (know how) (long) (does it) (take) (to tr-) (-avel) (from) (el-) (-mira) (to) (corn-) (-ning) Example Utterances
Deletion Errors • Most prominent in places framed by plosives. • Algorithm able to detect boundaries at ends of the phrase but not in middle, due to presence of ‘d’ plosives at the ends.
Insertion Errors • Most prominent in places where a vowel is stretched.
Recommendations for Further Work • Results presented are early research results, and are promising. • Future work will combine TDNN with other statistical methods like Expectation Maximization.