230 likes | 319 Views
Classification and Recognition of Adverse Dysarthric (Cerebral Palsy) Speech Using Hidden Markov Models (HMMs) and Artificial Neural Networks (ANNs) by Prasad D Polur. Dissertation Proposal presented to the guidance committee of: Dr. Gerald Miller Dr. Paul Wetzel Dr. Martin Lenhardt
E N D
Classification and Recognition of Adverse Dysarthric (Cerebral Palsy) Speech Using Hidden Markov Models (HMMs) and Artificial Neural Networks (ANNs)byPrasad D Polur Dissertation Proposal presented to the guidance committee of: Dr. Gerald Miller Dr. Paul Wetzel Dr. Martin Lenhardt Dr. Rosalyn Hobson Dr. Michael King Dept. of Biomedical Engineering VCU May 14, 2003
Background • Dysarthria is the most common acquired speech disorder affecting 170 per 100,000 populations. • In its severest form dysarthric speech is unintelligible to others and may take the form of producing vocal utterances, rather than words recognizable to unfamiliar communication partners. • Some people with dysarthric speech like cerebral palsied individuals are also severely motor-impaired with limited or no control of their local environment. • The combination of speech and general physical disability can make it particularly problematic for them to interact in their environment and limits independence. • Positive intervention/rehabilitation using modern research technology like ASR would hence be very desirable.
Background • Dysarthric errors result from a disruption of muscular control due to lesions of either the central or peripheral nervous systems, thereby the transmission of messages controlling the motor movements for speech is interrupted. • In dysarthria as exhibited by cerebral palsied individuals, errors are somewhat consistent and predictable, yet there are no islands of clear speech. • Errors are mainly distortions and omissions, with consonants being significantly distorted. • Here we shall use the term ‘dysarthric/cerebral palsy speech’ in this work to describe speech, which is difficult to understand and recognize as a result of the speaker’s disabilityand which is characterized by distortions and omissions. • As the rate of a dysarthric's speech increases, the intelligibility of that person's speech will decrease proportionally, hence he/she has to have a slow rate of articulation in order to be intelligible.
Background • In general, cerebral palsied individuals (who constitute a very significant percentage of dysarthric speakers) lack articulatory precision. • From intelligibility tests, errors in place of articulation were found to be generally due to consonant confusions related to alveolar, labial (/b/d/, /m/d/) and velar (/k/g/, /k/t/) sounds. • Intelligibility tests on listeners relegate most errors associated with phonemes that require extreme articulatory positions (i.e. stops like t, d, p and fricatives). • Severely impaired speakers and mildly impaired speakers differ in degree of disability rather than quality. • Hence in view of the nature of speech and rate of articulation an isolated word ASR system would be able to suitably assist them, if appropriately implemented.
Assistive Speech Technologies • There are a few assistive technologies currently being explored for positive intervention, in order to assist motor impaired people (including CP individuals). • A very promising area is in the application of ASR tailored to adverse cerebral palsy (dysarthric) speech, which would enable the people so affected to enhance intelligibility of speech to a untrained listener, improve articulation of speech, and in other control applications (mobility, appliance control etc). • Two popular ASR types use Feature based design (MFCC etc) and Statistical based design (VQ etc). • Feature-based systems need to compare the unknown word to its database. • For the application at hand, a feature-based design was considered appropriate, since it offers sufficient flexibility and performance.
Statement of Problem • A typical speech recognizer is a pattern recognition platform that has waveform data as input, extracts information (features) from that data, uses the information to hypothesize words chosen from its vocabulary, and outputs the ‘recognized’ word. • The recognition of any kind of signal depends on consistent extraction of certain unique features in that signal. • Typically the waveform contains some unwanted, irrelevant, ambiguous information, which has negative impact on the recognizer’s performance. • The process is more complicated in the case of adverse dysarthric speech due to the variations in production of phonemes, inconsistency of articulation and non-conformed speech patterns. • Clearly for efficient ASR application, some level of modification is required in current technology in order to accommodate such individuals.
Summary of Relevant Literature • HMMs and to some extent ANNs (recent development) have been utilized towards recognition of dysarthric speech with limited level of success. • Deller et al took a HMM-VQ approach to isolated dysarthric word recognition with suppression of transitional acoustics of the vector and reported positive results. • Jayaram et al took a ANN approach, where two multilayer neural networks, one having fast Fourier transform (FFT) coefficients and the other having format frequencies as inputs, were developed and tested using isolated words spoken by a dysarthric speaker and also reported good results. • Currently several projects (like ENABLE, Stardust etc) are being pursued (not yet published) in the US and Europe, towards improving intelligibility of dysarthric speech and also towards control applications, using HMM and ANN as investigative tools.
Research Proposal • Two questions arise: Is either of these technologies (HMM and ANN) more tolerant to speech variability as specifically exhibited by dysarthric speech over the other? • Would dysarthricspeech signal modification enhance the recognition rate over unmodified signal in the models? • Aim of these two questions is to identify a more robust technique coupled with a more robust technology for the specific application of isolated word recognition of cerebral palsy (dysarthric) speech. • Isolated word recognition would be sufficient here since adverse dysarthric speakers tend to have a very slow rate of articulation.
Research Proposal • In view of the questions raised I propose the following: • Develop a small vocabulary HMM based ASR with left-to-right (bakis) structure (since it is a popular and efficient normal speech recognition system), which specifically caters to adverse dysarthric (cerebral palsy) speech. • Develop a small vocabulary ANN based ASR with feedforward architecture (since it is also a popular and efficient normal speech recognition system), which specifically caters to adverse dysarthric (cerebral palsy) speech.
Research Proposal • Compare/contrast the two systems with an aim to identify if either of these systems provides any greater tolerance to dysarthric speech specifically, as evidenced by higher recognition percentage of the given test data set. • Such a comparison and its conclusions are valid for this application since the same data (modified or unmodified) would be used to test the two systems and the same/similar category search method would be employed to identify the final output. • This result when available ideally should not be interpreted as one system is better than the other (generalization), but may be selectively interpreted as one particular configuration (popular configuration) of the system may provide greater resilience than a particular configuration (popular configuration) of another system when specifically applied to cerebral palsy/dysarthric speech in this study.
Research Proposal • At this stage I would like to verify/identify if signal modifications like dynamic range suppression/transitional clipping of the data enhances recognition of dysarthric speech over unmodified speech data of same set/type. • Even though this method has been investigated recently using VQ technique, it has not been pursued for MFCC based feature systems (as I intend to do), thereby further verification of this technique is valid for this application given its specialized nature. • The most likely method for this purpose would be using some distance/error measure, which needs to be determined through trail.
Methodology • Some of the steps that I intend to take towards addressing the research proposed are as follows: • Data acquisition: The speech data of cerebral palsy patients would be recorded in a noise free environment (at 44.1 KHz) using TASCAM DA-P1 digital audio tape recorder and the data loaded directly into the computer through the DIO 2448 24 bit digital I/O card. • Or if the subjects are able, then the recording may directly be taken into the computer by using the standard inbuilt recorder.
Methodology Speech sample (11 kHz) • Frame generation and MFCC extraction: The First 12 most significant cosine Coefficients so obtained constitutes 12 MFCC coefficients (Acoustic Vector)
Methodology • Thus with the acoustic vectors available we can proceed to the next step which would be to devise a tool for pattern recognition of such vectors. • The two systems under consideration (HMM and ANN) are envisoned to have a few constraints, namely: - Speaker dependent systems; here multiple speaker recognition is not considered to be necessary/practical. - It would cater to isolated speech (as validated). - Small word vocabulary (since this research is meant to identify/or provide means to identify a more effective system for this specific application). • Expansion of the suitable model would be possible (scalable)
Methodology • Simple Illustration of the steps of recognition process: 1. Data recording 2. Digitized signal 3. Extract features/frame (MFCC) 4. Choose pattern classifying tool (HMM/ANN) 5. Pattern match – ‘Recognized’ word.
Methodology • HMM system: • The Hidden Markov Model (HMM) is a result of the attempt to model the speech generation statistically. • The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multi dimensional) probability distribution. • Transitions among the states are governed by a set of probabilities called transition probabilities. • In a particular state an outcome or observation can be generated, according to the associated probability distribution. • It is only the outcome, not the state which is visible to an external observer and therefore states are ``hidden'' to the outside; hence the name Hidden Markov Model.
Methodology • Once the HMM has been obtained, there are three problems of interest: • The Evaluation Problem:Given an HMM and a sequence of observations, what is the probability that the observations are generated by the model? This problem would be tackled by the Forward-Backward algorithm. • The Learning Problem: Given a model and a sequence of observations , how should we adjust the model parameters in order to maximize the probability mentioned above? This problem would be tackled by the Baum-Welch algorithm. • The Decoding Problem: Given a model and a sequence of observations , what is the most likely state sequence in the model that produced the observations? This problem would be tackled by the Viterbi algorithm.
Methodology • ANN system: • An artificial neural network (ANN) consists of a potentially large number of simple processing elements called nodes, which influence each other’s behavior via a network of excitatory or inhibitory weights. • Each node simply computes a nonlinear weighted sum of its inputs and transmits this result over its outgoing connections to other units. • A training set consists of patterns of values that are assigned to designate input and/or output units (here the training set would constitute the acoustic vectors, which would be associated with a designated set of categories). • As patterns are presented from the training set, a learning rule modifies the strengths of the weights so that the network gradually ‘learns’ the training set (here the learning rule would be learngdm-gradient with momentum -based on previous work).
Methodology • Neural networks have three types of layers (groups of nodes that function in tandem): one input layer, one or more hidden layers and one output layer. • The input layer would receive acoustic vectors in this case and the output layer communicates the network’s decisions (category classification). • Between the input and output layer are one or more hidden layers that communicate with each other and the output layer (here the number of layers would need to be evaluated in accordance with performance and speed requirements).
Methodology • In the feedforward networks architecture, the input layer supplies respective elements of the activation pattern (input vector) to the second layer as an input signal and so on. • In this research the feedforward architecture would be investigated in order reduce level of training required and processing time used (when compared to recurrent network). • The output of such a network would be used to identify the uttered test word in much the same way as done with the HMM system (likely using Viterbi algorithm).
Methodology • Comparison of performance: • Now the two systems would be compared/contrasted with an aim to identify if either of these systems provides any greater tolerance or robustness to dysarthric speech specifically, as evidenced by higher recognition percentage of the given test data set, keeping in mind a cautious interpretation. • Verification/Identification of signal modification leading to enhanced performance: • I would like to verify if signal modifications like dynamic range suppression/transitional clipping of the data (using distance/error measure) enhances recognition of dysarthric speech over unmodified speech data of same set/type (both models will be tested).
Future Direction • The purpose of this research would be to identify/modify and build a small vocabulary, robust speech recognition system which is capable of good performance when specifically applied to cerebral palsy speech. • Such a system once identified can be built further (enhanced robustness, increased vocabulary etc) such that it can be used as a driving base for specific control functions. • One promising area for example is in using this system to interface with a visual display/audio device such that un-conditioned listeners can correctly interpret the CP individual’s utterances.