“Speech Signal Processing’ Presented by: Dr. Buket D. Barkana Department of Electrical Engineering

“Speech Signal Processing’ Presented by: Dr. Buket D. Barkana Department of Electrical Engineering School of Engineering University of Bridgeport May 1, 2008

Outline • Speech Signal Processing (SSP) • Applications • Speech Synthesis (Text-to-Speech) • Speech Coding • Speaker Verification/Identification • Speech Recognition • Speech Enhancement (Audio Noise Reduction) • SSP Studies in UB • Speaker’s Gender Identification • Acoustical Properties of Noise Signals • Audio Signal Laboratory • Conclusion

Speech Signal Processing • Speech processing is the study of speech signals and the processing methods of these signals. • The signals are usually processed in a digital representation whereby speech processing can be seen as the intersection of digital signal processing and natural language processing. • Speech processing can be divided in the following categories: • Speech recognition, which deals with analysis of the linguistic content of a speech signal. • Speaker recognition, where the aim is to recognize the identity of the speaker. • Enhancement of speech signals, e.g. audio noise reduction, • Speech coding, a specialized form of data compression, is important in the telecommunication area. • Voice analysis for medical purposes, such as analysis of vocal loading and dysfunction of the vocal cords. • Speech synthesis: the artificial synthesis of speech, which usually means computer generated speech. • Speech enhancement: enhancing the perceptual quality of speech signal by removing the destructive effects of noise, limited capacity recording equipment, impairments, etc.

Basic Definitions: • Speech Production: • Speech signals are composed of a sequence of sounds. These sounds and the transitions between them serve as a symbolic representation of information. The arrangement of these sounds is governed by the rules of language. • Speech organs are divided into three main groups. • The lungs: power supply • Larynx: either a periodic puff-like or a noisy airflow source • Vocal tract: spectrally shaping • Depending on the type of excitation, two types of sounds are produced: voiced and unvoiced • Voiced sounds are produced by forcing air through the glottis or an opening between the vocal folds. Vocal folds vibrate. Quasi-periodic signals. (Ex. “a” in “Bob”) • Unvoiced sounds are generated by forming a constriction at some point along the vocal tract. Vocal folds do not vibrate. Noise-like signals. (Ex. “s” in “six”)

Voiced/Unvoiced Sounds:

Pitch Period or Fundamental Frequency: The pitch period is, in turn, the smallest repeating unit of a signal. One pitch period thus describes the periodic signal completely. • Methods: Auto Correlation Function (ACF), Cepstrum Analysis. • Formant Frequencies: • Resonance frequencies of the vocal tract tube are called formants. • The formants depend upon the shape and dimensions of the vocal tract; each shape is characterized by a set of formant frequencies. • A formant is a peak in the frequency spectrum of a sound caused by acoustic resonance. • Methods: FFT, spectrogram.

Formant Frequencies:

Female Speaker (Voiced speech)

Male Speaker (Voiced speech)

Unvoiced Speech

Speech Synthesis • Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. • Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.

A text-to-speech system is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization or pre-processing. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme conversion . Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.

The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. Application areas: To listen to written works on a computer (people with visual impairments or reading disabilities) Airports Call centers Many computer operating systems have included speech synthesizers since the early 1980s. Text-to-speech (TTS) is the generation of synthesized speech from text. Our goal is to make synthesized speech as intelligible, natural and pleasant to listen to as human speech and have it communicate just as meaningfully.

Speech Coding • Speech coding is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bit stream. • The two most important applications of speech coding are mobile telephony and Voice over IP. • Speech Coding Methods: • Waveform coders • Vocoders • Hybrid Coders • In speech coding, the most important criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data. • The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre etc. that are all important for perfect intelligibility.

Ref. David Tipper, digital Signal Processing, University of Pittsburgh

The applications of speech coding: • Speech coding for efficient transmission and storage of speech • Narrowband and broadband wired telephony • Cellular communications • Voice over IP (VoIP) to utilize the Internet as a real-time communications medium • Secure voice for privacy and encryption for national security applications • Extremely narrowband communications channels, e.g., battlefield applications using HF radio • Storage of speech for telephone answering machines.

Extract and interpret the meaning of recognized speech -- supports complex natural-language dialog services. • Includes a machine learning capability for understanding what customers are calling about. • Includes a finite-state capability for detecting and extracting named entities (e.g., names, addresses, etc). • Supports rapid prototyping by combining prior knowledge and data. • World-class speech understanding engine, used for Platinum VoiceTone applications. • Based on AT&T pioneering research in machine learning and finite-state automata.

Speaker Identification and Verification • Speaker identification: It is a type of speaker recognition. It is the problem of identifying a person solely by their voice. • police investigations • identify talkers in a discussion, • alert speech recognition systems of speaker changes, • check if a user is already enrolled in a system • Speaker identification problems generally fall into two categories: • Differentiating multiple speakers when a conversation is taking place. • Identifying an individual's voice based upon previously supplied data regarding that individual's voice. • Speaker identification is based on complex voice processing algorithms. • Text-independent systems are most often used for speaker identification as they require very little if any cooperation by the speaker. In this case the text during enrollment and test is different. In fact, the enrollment may happen without the user's knowledge. Some recorded piece of speech may suffice.

Speaker Verification: It is a problem of verifying a certain identity and the voice by the speaker’s voice. • Speaker verification is usually used in applications which require secure access. • Methods: • The various technologies used to process and store voiceprints include frequency estimation, hidden Markov models, pattern matching algorithms, neural networks, matrix representation and decision trees.

Speech Recognition • Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input. • Applications: • Health care (medical documentation) • Military (High-performance fighter aircraft) • the program in France on installing speech recognition systems on Mirage aircraft • U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft • setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays • Helicopters(background noise) • Battle management (require rapid access to and control of large, rapidly changing information databases in an eyes-busy environment) • Training air traffic controllers • Disabled people (for people who are unable to use their hands)

Performance of speech recognition systems • The performance of speech recognition systems is usually specified in terms of accuracy and speed. Accuracy may be measured in terms of performance accuracy which is usually rated with word error rate (WER), whereas speed is measured with the real time factor. • Commercially available speaker-dependent dictation systems usually require only a short period of training (sometimes also called `enrollment') and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. `Optimal conditions' usually assume that users: • have speech characteristics which match the training data, • can achieve proper speaker adaptation, and • work in a clean noise environment (e.g. quiet office or laboratory space). • This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected.

SSP Studies in UB • Speaker’s Gender Identification • Acoustical Properties of Noise Signals • Audio Signal Laboratory

Energy Estimation between Adjacent Formant Frequencies to Identify Speakers’ GenderDeepawale D.S., Bachu R., Barkana B.D. • A method developed for gender identification that combines first three formants, pitch period, and the energy between each adjacent pair of formants. The information provided by the energy, formants, and pitch estimation is combined using a classifier to identify the gender of the speaker. All the parameters are determined for a voiced part of the speech samples.

Spectrogram of the words “four” and “eight” for a female and a male speaker: The distance between the first three formants varies appreciably in frequency domain. In particular, the average distance between adjacent formants for females is generally much bigger than the average distance between adjacent formants for males. Energy is calculated between each adjacent pair of formants. We do not use 25% of data between each formant and the stop or start point of the frequency band defining the short-time energy.

Results: Energy between the middle of the each adjacent pair of formants for females is smaller than that of males. Especially, the difference of energy between second and third formants for males and females is very clear and sharp for some vowels. Therefore, this can be a very important parameter to identify a speakers’ gender.

The acoustical properties of common background noises Deepawale D.S., Bachu R., Potla S., Barkana B.D. • Commonly encountered background noises: subway, highway, inside train, inside car, rain, restaurant, and airport. • People work under background noises • People with hearing-aid device • Speech signal processing • The prevention or reduction of background noise is important both in the field of speech signal processing and in the life of common man. In past few years many number of techniques have been developed for speech enhancement, speech recognition and hearing aids .The main aim of these techniques are to reduce the background noise.

Actual noise monitoring systems have the shortcoming that although the intensity, duration, and time of occurrence of noises can be recorded, their source often cannot be identified. Such information would be particularly useful when multiple noise sources are possible. This has led to research directed towards providing an “intelligent” noise monitoring system able to distinguish between the acoustic signatures of different noise sources. Various techniques have been proposed for that purpose, neural networks, linear classifier, ad-hoc methods and statistical pattern recognition are among them. • Four physical factors are extracted from the ACF in our study: • (1) Energy represented at the origin of delay, • (2) Effective duration of the envelope of the normalized ACF, • (3) The amplitude of the first maximum peak of the normalized ACF, and • (4) Its delay time.

Audio Processing Lab • 11x TMS320VC5510 DSK • Headphones, microphone • Centered on a set of experiments for the TMS320VC5510 DSP, the goal of this course is to teach how to program the TMS320VC5510 using C++ and MATLAB and illustrate concepts from theory of audio/speech signal processing. • Lecture will cover background material pertinent to lab, in these areas: • The acoustics and acoustic analysis of audio/speech • The physiology of audio/speech production • Filter designs • Echo cancellation • The perception of audio/speech • Audio/speech disorders • Coding Techniques • MP3 Encoding/Decoding

“Speech Signal Processing’ Presented by: Dr. Buket D. Barkana Department of Electrical Engineering