260 likes | 464 Views
Lecture 16 Speaker Recognition. Information College, Shandong University @ Weihai. Definition. Method of recognizing a Person form his/her voice. Depends on Speaker Specific Characteristics To determine whether a specified speaker is speaking in a given segment of speech
E N D
Lecture 16 Speaker Recognition Information College, Shandong University @ Weihai
Definition • Method of recognizing a Person form his/her voice. • Depends on Speaker Specific Characteristics • To determine whether a specified speaker is speaking in a given segment of speech • This task is the one closest to biometric identification using speech
Voice is a popular Biometric • Voice Biometric: • Natural signal to produce • Does not require a specialized input device • Can be used on site or remotely • Telephone banking, Voice mail browsing, …. • Security • Keys, card, ... • Passwords, PIN, ... • Fingerprint, voiceprint, Iris-print…
Similar Tasks • Speaker Verification • Extract information from the stream of speech. • Verifies that a person is who she/he claims to be. • One-to-one comparison. • Speaker Recognition • Extract information from the stream of speech. • Assigns an identity to the voice of an unknown person. • One-to-many comparison. • Speech Recognition • Extracts information from the stream of speech. • Figures out what a person is saying.
Task of Today • Speech Recognition • History • Scheme • Speaker Features • Methods
Recognition Milestone • 1920, first electromechanical toy: “Rex'‘, (Elmwood Co. ) • Late ‘1940s, US Defense, Automatic Translation Machine • Project failed, but sparked the research at MIT, CMU, commercial institutions. • During 1950's, first system capable of recognizing digits spoken over the telephone was developed by Bell Labs. • 1962, “Shoebox” form IBM • In early 1970's, the system HARPY capable of sentences, limited grammar, by Carnegie-Mellon University. • HARPY required so much computing power as in 50 contemporary computers. • Moreover, the system recognized discrete speech, where words are separated by longer pauses than usual.
Recognition Milestone • In the 1980’s, significant progress in speech recognition technology: • Word error rates continue to drop by factor of 2 every two years. • IBM in 1985, in real time, isolated words from set of 20,000 after 20-minute training, with error rate < 5%. • AT&T, call routing system, speaker independent word-spotting technology, few key phrases. • Several very large vocabulary dictation systems: • require speakers to pause between words. • Better for specific domain. • In 1990's: • VoiceBroker deployed by Charles Schwab, stock brokerage, in 1996. • ViaVoice by IBM, first distributed with the now almost forgotten operating system OS/2 in 1996. • 1997, Dragon introduced Naturaly Speaking, first continuous speech recognition package • Today: • Airline reservations with British Airways, • Train reservation for Amtrak, • Weather forecasts & telephone directory information
Terminology of Speech Recognition • Speaker Dependent Recognition • The recognition system is designed to work with just one or a small number of individual speakers • Speaker Independent Recognition • These systems are designed to work with all the speakers from a given linguistic community
Terminology of Speech Recognition • Large Vocabulary Recognition • Example are domain specific recognition systems such as used by medical consultants for dictating notes on their ward rounds • Very difficult to make accurate large vocabulary, speaker independent systems • Small Vocabulary Recognition • Typically recognition of a few keywords such as digits or a set of commands. • Example: voice operated telephone number dialing
Terminology of Speech Recognition • Isolated Word Recognition: • Systems which can only recognize individual words which are preceded and followed by relatively long period of silence • Connected Word Recognition: • Systems which can recognize a limited sequence of words spoken in succession (e.g. “Ninety-eight thirty-five four thousand”) • Continuous Word Recognition: • These systems can recognize speech as it occurs and recognize the speech in real time. Such system usually work with large vocabulary, but with moderate accuracy.
Speech Recognition Scheme • Three steps in Speech recognition are performed in ANY recognition system: • Feature Extraction • Measurement of similarity • Decision making
Recognition Systems Pattern matching is constrained in many ways, e.g. the rules of language (grammar), spelling and possible pronunciations Derive a compact representation of the speech waveform reference patterns accept/ reject speech feature extraction pattern matching decision rule test pattern Find the word with the greatest similarity to the input speech c0(t) c1(t) ... cM(t) c0(t) c1(t) … cM(t) 2c0(t) 2c1(t) … 2cM(t)
Speaker Recognition Features • The features are low-level speech signal representation parameters that convey complete information about the signal. • High-level characteristics like accent, intonation, etc. are encoded within the representation in a very complex and cryptic manner. • The features contain speaker-dependent components. • Uniqueness and permanence of the features is problematic.
Questions • Do the features that uniquely characterize people exist? • Uniqueness and permanence of most of the features used in biometric systems have not been proven. • Is the human’s ability to identify a person a limit that no automatic system can overcome? • Automated systems might be able to identify people better than average person can do. In practice, expert systems do not perform the task better than the experts who built them.
Questions • How important are the algorithms versus the knowledge of features and their relationships to achieve high identification accuracy? • Knowledge of features and their relationships is fundamental for accurate biometric systems. The algorithms play an important, still secondary, role in the process as no algorithm can compensate for the lack of the adequate features.
Speaker models • Used to represent the speaker specific information conveyed in the feature vectors • Several different modeling techniques have been applied: • Template Matching • Nearest Neighbor • Neural Networks • Hidden Markov Models • State-of-the-art speaker recognition algorithms are based on statistical models of short-term acoustic measurements on the input speech signal
Speaker models • Use long-term averages of acoustic features(spectrum, pitch…) first and earliest Idea : • To average out the factors influencingintra-speaker variation, leave only the speaker dependent component. • Drawback : required long speech utterance(>20s) • Training SD model for each speaker • Explicit segmentation: HMM • Implicit segmentation: VQ,GMM
Speaker models • HMM: • Advantage : Text-independent • Drawback : A significant increase in computational complexity • VQ: • Advantage : Unsupervised clustering • Drawback : Text-dependent • GMM : • Advantage : Text-Independent, Probabilistic framework (robust), Computationally efficient, Easily to be implemented.
Speaker models • Discriminative Neural Network • Model the decision function which best discriminate speakers • Advantage : Less parameters, higher performance compared to VQ model. • Drawback : The network must be retrained when a new speaker is added to the system.
Progressing VQ NN HMM VQ NN GMM HMM VQ NN 1985 1995 Easy Word Error Rate Hard 21 State of the Art: Speech Recognition
QV Example distortion This sample has less distortion for A than for B Acoustic Space 2 Speaker A Speaker B Acoustic Space 1
[ow] [ey] 0.2 0.5 [t] [m] [t] [ow] [ah] [aa] 0.8 0.5 HMM Example • Two model of “tomato” Word in the vocabulary is presented with phonemes. Each phoneme is viewed as an HMM A word model is constructed by combining HMMs for the phonemes
Gaussian Mixture Model (GMM) Speech Recognition (GMM) State Level
Gaussian Mixture Model (GMM) Speaker Recognition Speaker k …… ……… ………………………
Limits • The best performing algorithms for text-independent speaker verification use Gaussian Mixture Models (GMM) (single state HMM) • The linguistic structure of the speech signal is not taken into account and all sounds are represented using a unique model • The sequential information is ignored • There is a recent trend in using High-level features • Large Vocabulary Continuous Speech Recognition System • Good results for a small set of languages • Need huge amount of annotated speech databases (an enormous amount of time and human effort ) • Language and task dependent