Introduction to Biometrics

Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #12 Biometric Technologies: Voice Scan October 3, 2005

Outline • Introduction • How does it work • Components • Voice Scan Process • Template generation and matching • Market and Applications • Strengths and Weaknesses • Research Directions • Summary • Appendix: Banking Application

References • Course Text Book, Chapter 7 • http://www.biometricsinfo.org/voicerecognition.htm

Introduction • Voice Recognition is a technology which allows a user to use his/her voice as input. • Voice recognition may be used to dictate text into the computer or to give commands to the computer (such as opening application programs, pulling down menus, or saving work). • Older voice recognition applications require each word to be separated by a distinct space. • This allows the machine to determine where one word begins and the next stops. • These kinds of voice recognition applications are still used to navigate the computer's system, and operate applications such as web browsers or spread sheets.

Introduction (Continued) • Newer voice recognition applications allow a user to dictate text fluently into the computer. • These new applications can recognize speech at up to 160 words per minute. • Voice recognition uses a neural net to "learn" to recognize voice. • As you speak, the voice recognition software remembers the way you say each word. • This customization allows voice recognition, even though everyone speaks with varying accents and inflection. • In addition to learning how you pronounce words, voice recognition also uses grammatical context and frequency of use to predict the word you wish to input. • While the accuracy of voice recognition has improved users still experience problems with accuracy either because of the way they speak or the nature of their voice.

How does it work? • Voice recognition technology utilizes the distinctive aspects of the voice to verify the identity of individuals. • Voice recognition is occasionally confused with speech recognition, a technology which translates what a user is saying (a process unrelated to authentication). • Voice recognition technology, by contrast, verifies the identity of the individual who is speaking. • The two technologies are often bundled – speech recognition is used to translate the spoken word into an account number, and voice recognition verifies the vocal characteristics against those associated with this account.

How does it work? (Concluded) • Voice recognition can utilize any audio capture device, including mobile and land telephones and PC microphones. • The performance of voice recognition systems can vary according to the quality of the audio signal as well as variation between enrollment and verification devices • During enrollment an individual is prompted to select a passphrase or to repeat a sequence of numbers. • The passphrases selected should be approximately 1-1.5 seconds in length – very short passphrases lack enough identifying data, and long passwords have too much, both resulting in reduced accuracy. • The individual is generally prompted to repeat the passphrase or number set a handful of times, making the enrollment process somewhat longer than most other biometrics.

Components of the Voice Scan System • User’s spoken phrase is converted from analog to digital formant and transmitted to local or central PC • For desktop verification applications, engine that provides templates based functions may reside on the local or central PC • For telephone based applications the software may reside in the Institution that users are interacting with • Voice scan comparisons are tied directly to existing authenticating systems • May be web-enabled

Process: • Data Acquisition • Audio capture devices include mobile and land telephones and PC microphones • Individual selects a passphrase and repeats it or repeats sequence of numbers • Should be long enough • Not too loud or soft • More difficult with PC/ mobile phones than with land telephones • Data Processing • The data is proceed before template creation • Eliminates gaps and performs filtering

Distinctive Features • Measures vocal qualities not detectable by humans • Pitch and frequency are key features measured • Voice scan algorithms also measure • gain or intensity • short time spectrum of speech, • format frequencies, • linear prediction coefficients, • cepstral coefficient • Spectrograms • Nasal coarticulation • Replicable only by human voice and therefore more secure

Template Creation/Generation • Based on statistics based pattern matching called Hidden Markov Models (HMM) • HMM are generalized profiles that are formed through the comparison of multiple samples to find characteristically repeating patterns • During enrollment template generation relies on the capture of multiple voice samples and are analyzed to determine the qualities that can be relied upon for later recognition

Template Matching • Production voice scan technologies are not capable of one-many identification • Operates in one-one authentication mode • When user attempts verification the system compares the live submission with the profile created and then returns a statistical rating • Users may change their speech during enrollment and verification and therefore not very reliable

Applications • Voice recognition is a strong solution for implementations in which vocal interaction is already present. • It is not a strong solution when speech is introduced as a new process. • Telephony is the primary growth area for voice recognition, and will likely be by far the most common area of implementation for the technology. • Telephony-based applications for voice recognition include account access for financial services, customer authentication for service calls • These solutions route callers through enrollment and verification subroutines, using vendor-specific hardware and software integrated with an institution's existing infrastructure. • Voice recognition has also been implemented in physical access solutions for border crossing

Deployment • NYC Department of Corrections – NY DOC • Used to check the location of Juvenile offenders • The offender is called and he/she has to call back. The voice is verified and also caller ID is checked • Pilot projects in banking • Ireland, Belgium, South Africa • Technology form T-NETIX and Buytel

Market • Though revenues from the technology are relatively small today, voice recognition will likely draw substantially greater revenues through 2007. • Most likely to be deployed in telephony-based environments (such as account access for financial services and customer authentication for service calls). • Voice recognition revenues are projected to grow from $12.2m in 2002 to $142.1m in 2007. • Voice recognition revenues are expected to comprise approximately 4% of the entire biometric market.

Strengths of Voice Scan • One of the challenges facing large-scale implementations of biometrics is the need to deploy new hardware to employees, customers and users. • One strength of telephony-based voice recognition implementations is that they are able to circumvent this problem, especially when they are implemented in call center and account access applications. • The ability to use existing telephones means that voice recognition vendors have hundreds of millions of authentication devices available for transactional usage today. • Resistant to Imposters • Imposter may not guess correct passphrases and account numbers

Weaknesses of Voice Scan • There may be noise with the voice • Low accuracy as enrollment voice may differ from verification voice for the same user • Large template size • Does not work well with PC

Research Directions • Improve accuracy • Model variations of voice for the same speaker • Improve performance • Better PC-based methods • Better models: • HMM, neural networks

Technology Comparison • MethodCoded PatternMisidentification rateSecurity • Iris RecognitionIris pattern1/1,200,000 • FingerprintingFingerprints1/1,000 • Hand ShapeSize, length and thickness of hands1/700 • Facial RecognitionOutline, shape and distribution of eyes and nose1/100 • SignatureShape of letters, writing order, pen pressure1/100 • VoiceprintingVoice characteristics1/30

Summary • Can be widely used • Telephones available • Low accuracy • People can change voices • Many applications • E.g., Banking Telephony

Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Telephone Banking Application of Voice Scan October 3, 2005

The Problem • Telephone banking is increasingly popular with customers, and will be increasingly attractive to banks and other financial institutions as they start to implement highly cost effective automated speech recognition technology to handle routine transactions (the subject of another "financial futures" web page). • But the procedures for verifying customers over the telephone are unsatisfactory, both in terms of customer convenience and also, increasingly, from a security point of view. • The Problem: The usual approach to verifying customers - proving that they are who they claim to be - is to use some sort of PIN or password. To avoid the customer having to say the password out loud, they are usually prompted for, say, the second and fourth letters in the password.

The Problem (Concluded) • There are several problems with this approach: • Firstly, passwords and PINs are difficult to remember and unwieldy for customers to use in this manner. • Secondly, it takes time - identification and verification of the caller is often the lengthiest component of a transaction and this translates directly to the bottom line. • Many customers write down their passwords or reveal them to the operator (in extreme cases they may self select the same PIN that they use for ATM withdrawals). • Many call centers prompt the caller for additional 'secret' items such as their mother's maiden name, but this only exacerbates the other two problems.

The Solution • The Solution? Voice Verification: Technology now exists which enables individuals to be reliably, rapidly and cost-effectively verified on the basis of the physical characteristics of their voice. • Vendors now supply commercial voice verification technology. • A good example is Nuance Communications, based in California, using essentially the same technology which underlies their speaker independent speech recognition software. • But in this case recognition is speaker dependent - the customer is only allowed to use the system if their individual voiceprint matches their identity (normally established though an account number).

The Solution (Concluded) • A new customer automatically enrolls in the system over the telephone by repeating about 10 four digit numbers or reading a short piece of text. • The software extracts from this a number of physical characteristics which are unique to that voice. • In all subsequent transactions, the caller, once identified, is asked to repeat a couple of randomly generated PINs or, for example, names of cities (this is to prevent imposters tape-recording a customer saying their password or PIN). • If the voiceprint matches the one stored against the account number the transaction proceeds; if not, the customer is referred to a supervisor.

Details • The speaker-specific characteristics of speech are due to differences in physiological and behavioral aspects of the speech production system in humans. • The main physiological aspect of the human speech production system is the vocal tract shape. • The vocal tract is generally considered as the speech production organ above the vocal folds, which consists of the following: • (i) laryngeal pharynx • (ii) oral pharynx • (iii) oral cavity • (iv) nasal pharynx • (v) nasal cavity

Details (Continued) • The vocal tract modifies the spectral content of an acoustic wave as it passes through it, thereby producing speech. • Hence, it is common in speaker verification systems to make use of features derived only from the vocal tract. • In order to characterize the features of the vocal tract, the human speech production mechanism is represented as a discrete-time system • The acoustic wave is produced when the airflow from the lungs is carried by the trachea through the vocal folds. • This source of excitation can be characterized as phonation, whispering, frication, compression, vibration, or a combination of these.

Details (Continued) • Phonated excitation occurs when the airflow is modulated by the vocal folds. • Whispered excitation is produced by airflow rushing through a small triangular opening between the arytenoid cartilage at the rear of the nearly closed vocal folds. • Frication excitation is produced by constrictions in the vocal tract. • Compression excitation results from releasing a completely closed and pressurized vocal tract. • Vibration excitation is caused by air being forced through a closure other than the vocal folds, especially at the tongue.

Details (Continued) • Speech produced by phonated excitation is called voiced, • Produced by phonated excitation plus frication is called mixed voiced • Produced by other types of excitation is called unvoiced. • It is possible to represent the vocal-tract in a parametric form as the transfer function H(z). • In order to estimate the parameters of H(z) from the observed speech waveform, it is necessary to assume some form for H(z). • Ideally, the transfer function should contain poles as well as zeros. • However, if only the voiced regions of speech are used then an all-pole model for H(z) is sufficient.

Details (Concluded)

Choice of Features • The LPC (linear predictive coding) features were very popular in the early speech-recognition and speaker-verification systems. • However, comparison of two LPC feature vectors requires the use of computationally expensive similarity measures • Hence LPC features are unsuitable for use in real-time systems. • The use of the cepstrum has been suggested, defined as the Inverse Fourier transform of the logarithm of the magnitude spectrum, in speech-recognition applications. • The use of the cepstrum allows for the similarity between two cepstral feature vectors to be computed as a simple Euclidean distance. • It has been demonstrated that the cepstrum derived from the LPC features results in the best performance • Consequently, LPC derived cepstrum for speaker verification system is used in general.

Spoeaker Modeling • Using cepstral analysis, an utterance may be represented as a sequence of feature vectors. • Utterances spoken by the same person but at different times result in similar yet a different sequence of feature vectors. • The purpose of voice modeling is to build a model that captures these variations in the extracted set of features. • There are two types of models that have been used extensively in speaker verification and speech recognition systems: • stochastic models and template models.

Speaker Modeling (Continued) • The stochastic model treats the speech production process as a parametric random process and assumes that the parameters of the underlying stochastic process can be estimated in a precise, well defined manner. • The template model attempts to model the speech production process in a non-parametric manner by retaining a number of sequences of feature vectors derived from multiple utterances of the same word by the same person. • Template models dominated early work in speaker verification and speech recognition because the template model is intuitively more reasonable. • However, recent work in stochastic models has demonstrated that these models are more flexible and hence allow for better modeling of the speech production process.

Speaker Modeling (Concluded) • A very popular stochastic model for modeling the speech production process is the Hidden Markov Model (HMM). • HMMs are extensions to the conventional Markov models, wherein the observations are a probabilistic function of the state • the model is a doubly embedded stochastic process where the underlying stochastic process is not directly observable (it is hidden). • The HMM can only be viewed through another set of stochastic processes that produce the sequence of observations. • Thus, the HMM is a finite-state machine, where a probability density function p(x | s_i) is associated with each state s_i. The states are connected by a transition network, where the state transition probabilities are a_{ij} = p(s_i | s_j).

Pattern Matching • The pattern matching process involves the comparison of a given set of input feature vectors against the speaker model for the claimed identity and computing a matching score. For the Hidden Markov models the matching score is the probability that a given set of feature vectors was generated by the model.

Introduction to Biometrics