Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore. ctank@comp.nus.edu.sg

Motivations • Humans are able to automatically segment words and sounds in speech with little difficulty. • The ability to automatically segment words and phonemes also useful in training speech recognition engines.

Principle • Time-Delay Neural Network • Input nodes have shift registers that allow the TDNN to generalize not only between discrete input-output pairs, but also over time. • Ability to learn true word boundaries given reasonably good initial estimations. • We make use of this property for our work.

Why TDNN? • Representational simplicity • Intuitively easy to understand what inputs to TDNN and outputs to TDNN represent. • Ability to generalize over time • Hidden Markov Models have been left out of this work for now.

Time Delay Neural Networks • Diagram shows a 2-input TDNN node. • Constrained weights allow generalization over time.

Boundary Shift Algorithm • Initially: • The TDNN is trained on a small manually segmented set of data. • Given the expected number of words in a new, unseen utterance, the cepstral frames in the utterance is distributed evenly over all the words. • For example, if there are 2,000 frames and 10 expected words, each word is allocated 200 frames. • Convex-Hull and Spectral Variation Function methods may be used to estimate the number of words in the utterance. • For our experiments we manually counted the number of words in each utterance.

Boundary Shift Algorithm • The minimally trained TDNN is retrained using both its original data and the new unseen data. • After retraining, a variable-sized window is placed around each boundary. • Window is initially +/- 16 frames • A search is made within the window for the highest scoring frame. The boundary is shifted to that frame. • This search is allowed to search past boundaries into neighboring words. • TDNN is retrained using new boundaries.

Boundary Shift Algorithm • Windows are adjusted by +/- 2 frames (i.e. reduced by a total of 4 frames), and steps 3 to 5 are repeated. • Algorithm ends when boundary shifts are negligible, or windows shrink to 0 frames.

Network Pruning • Limited training data lead to the problem of over-fitting. • Three parameters are used to decide which TDNN nodes to prune. • Significance j(max), , which measures how much a particular node contributes to the final answer. A node with a small Significance value contributes little to the final answer and can be pruned.

Network Pruning • Three parameters are used to prune the TDNN: • The variance j, which measures how much a particular node changes over all the inputs. A node that changes very little over all the inputs is not contributing to the learning, and can be removed. • Pairwise node distance ji, which measures how node changes with respect to another.A node that follows another node closely in value is redundant and can be removed.

Network Pruning • Thresholds are set for each parameter. Nodes with parameters falling below these thresholds are pruned. • Selection of thresholds is critical. • Pruning is performed after the TDNN has been trained on the initial set for about 200 cycles.

Experiments • TDNN Architecture • 27 Inputs • 13 dcep coefficients, 13 ddcep coefficients, power. • 5 input delays • 96 Hidden Nodes • Arbitrarily chosen, to be pruned later. • 2 Binary Output Nodes • Represents word start and end boundaries.

Experiments • Data gathered from 6 speakers • 3 male, 3 female. • Solving task similar to CISD Trains Scheduling Problem (Ferguson 96). • About 20-30 minutes of speech used to train TDNN. • 20 utterances, previously unseen, chosen to evaluate performance.

Inside Test Outside Test Precision: 66.88% Precision: 56.22% Recall: 67.33% Recall: 76.69% F-Number: 67.07% F-Number: 64.88% Experiment ResultsPerformance Before Pruning • Results shown relative to hand-labeled samples.

Inside Test Outside Test Precision: 66.03% Precision: 57.10% Recall: 61.41% Recall: 72.16% F-Number: 63.61% F-Number: 61.71% Experiment ResultsPerformance After Pruning

Subject: CK Utterance: Ok thanks, now I need to find out how long does it need to travel from Elmira to Corning (okay) (th-) (-anks) (now) (i need) (to) (find) (out how) (long) (does it need) (to) (travel) (f-) (-om) (emira) (to c-) (orning) Example Utterances

Subject: CT Utterance: May I know how long it takes to travel from Elmira to Corning? (may i) (know how) (long) (does it) (take) (to tr-) (-avel) (from) (el-) (-mira) (to) (corn-) (-ning) Example Utterances

Deletion Errors • Most prominent in places framed by plosives. • Algorithm able to detect boundaries at ends of the phrase but not in middle, due to presence of ‘d’ plosives at the ends.

Insertion Errors • Most prominent in places where a vowel is stretched.

Recommendations for Further Work • Results presented are early research results, and are promising. • Future work will combine TDNN with other statistical methods like Expectation Maximization.

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks

Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks

Presentation Transcript

Learning in Neural and Belief Networks

Learning with Neural Networks

Speech Sound Production: Recognition Using Recurrent Neural Networks

Classification Using Neural Networks

Learning via Neural Networks

Using Matlab Neural Networks Toolbox

Simulation Metamodeling using Dynamic Bayesian Networks in Continuous Time

Learning Neural Networks (NN)

Learning Delay: Language/Speech

Part-Of-Speech Tagging using Neural Networks

Learning Algorithm and Neural Networks

Speech Recognition through Neural Networks

Project 1: Machine Learning Using Neural Networks

Emotional Speech Analysis using Artificial Neural Networks

Learning from relational databases using recurrent neural networks

Boundaries of Free Speech

Applications of Neural Networks in Time-Series Analysis

Competitive Learning Neural Networks

Supervised Learning in Neural Networks

Simulation Metamodeling using Dynamic Bayesian Networks in Continuous Time

Learning in Neural Networks