1.33k likes | 1.5k Views
Acoustic Databases. Jan Odijk ELSNET Summer School, Prague, 2001. Acknowledgements. Part of the slides have been borrowed from or are based on work by Bart D’Hoore Hugo van Hamme Robrecht Comeyne Dirk van Compernolle Bert van Coile. Overview. What is a speech database?
E N D
Acoustic Databases Jan Odijk ELSNET Summer School, Prague, 2001
Acknowledgements • Part of the slides • have been borrowed from or • are based on work by • Bart D’Hoore • Hugo van Hamme • Robrecht Comeyne • Dirk van Compernolle • Bert van Coile
Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications
Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications
Linguistic Resources(LRs) • Linguistic Resources are sets of language data in machine readable form that can be used for developing, improving or evaluating language and speech technologies. • Some language and speech technologies • Text-To-Speech (TTS) • Automatic Speech Recognition (ASR) • Dictation • Speaker Verification/recognition • Spoken Dialogue • Audio Mining • Machine Translation • Intelligent Content Management • ….
Linguistic Resources(LRs)Major Types • Electronic Text Corpora • Newspapers, magazines, etc. • Usenet texts, e-mail, correspondence • Etc. • Lexical Resources • Monolingual lexicons • Translation lexicons • Thesauri • … • Acoustic Resources • Annotated Speech Recordings • Annotated Recordings of other acoustic signals • Coughing, throat clearing, breathing, … • Door slamming, screeching tires (of a car),…
Types of Linguistic Resources Acoustic Resources • Acoustic Databases (ADBs) • Controlled recording of human speech or other acoustic signals • Enriched with annotations • Recorded in a digital way • Representative of targeted application environment and medium • Balanced for phonemes/phoneme combinations • Speaker parameters, recording quality, environment/medium documented
Types of Linguistic Resources Acoustic Resources • Annotated unstructured recordings • Broadcasted material • Recorded conversations/monologues/speeches etc • Dictated material • Enriched with annotations
Types of Linguistic Resources Acoustic Resources • In-service data • Recorded sessions of interaction humans-running application • Usually by logging a customer system • Enriched with annotations • Used for tuning models, grammars,etc. to specific application
Types of Linguistic Resources Acoustic Resources • Environments • “Quiet” • Studio • Quiet office • Normal office • Noisy • Public place (street, hotel lobby, station, etc.) • Car (running engine 0km/hr, city, highway) • Industrial environment
Types of Linguistic Resources Acoustic Resources • Media • HQ close-talk microphone • Desktop Microphones • Telephone • analog or digital • fixed line or mobile • Wide band microphones • Array microphones • PC/PDA etc. low quality microphone
Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications
Acoustic Resources Use • (for speech synthesis modules in TTS systems) • (as acoustic reference material for pronunciation lexicons) • Mainly for speech recognition • Training and test material for research into new recognition engines and engine features • Training and test material for development of acoustic models • Tuning of acoustic models for specific applications
What is speech recognition? • ASR: Automatic speech recognition • Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. • Speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples.
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Feature Extraction • Turning speech signal into something more manageable • Do analysis once every 10ms • Data compression: 220 byte => 50 byte => 4 byte • Sampling of a signal: transforming into a digital form • Extracting relevant parameters from the signal • Spectral information, energy, pitch,... • Eliminate undesirable elements (normalization) • Noise • Channel properties • Speaker properties (gender)
10.3 1.2 -0.9 . 0.2 Feature Extraction: Vectors • Signal is chopped in small pieces (frames), typically 30 ms • Spectral analysis of a speech frame produces a vector representing the signal properties. • => result = stream of vectors
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Acoustic Model • Split utterance into basic units, e.g. phonemes • The acoustic model describes the typical spectral shape (or typical vectors) for each unit • For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme • Must cope with pronunciation variability • Utterances of the same word by the same speaker are never identical • Differences between speakers • Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples
f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13
S S6 T S7 A S8 R S9 T S10 S1 S T S2 S3 O S4 P Acoustic Model: Units • Phoneme: share units that model the same sound • Word: series of units specific to the word Stop Start Stop Start
,S ST TO OP P, S|,|T T|S|O O|T|P P|O|, Acoustic Model: Units • Context dependent phoneme Stop • Diphone Stop • Other sub-word units: consonant clusters ST O P Stop
Acoustic Model: Units • Phonemes • Phonemes in context: spectral properties depend on previous and following phoneme • Diphones • Sub-words: syllables, consonant clusters • Words • Multi words: example: “it is”, “going to” • Combinations of all of the above
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features (viterbi) • Search algorithm: looks for the best scoring word or word sequence
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Language Model • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations
Language Model • Two major types • Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; • Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large corpus
Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription • Enumerated, and/or • Generated by a Grapheme-to-Phoneme (G2P) module
Post Processing • Re-ordering of Nbest list using other criteria: e.g. account numbers, telephone numbers • Spelling: name search from a list of known names • Applying NLP techniques that fall outside the scope of the statistical language model • E.g. “three dollars fifty cents” “$ 3.50” • “doctor Jones” “Dr. Jones” • Etc.
Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model
Training of Acoustic Models • Database design • Coverage of units: word, phoneme, context dependent unit • Coverage of population (region, dialect, age, …) • Coverage of environments (car, telephone, office,..) • Database collection and validation • Checking recording quality • Annotation: describing what people said, extra-speech sounds • Dictionaries • Phonetic transcription of words • Multiple transcriptions needed • G2P: automatic transcription
2.1 -0.2 1.9 . -0.3 10.3 1.2 -0.9 . 0.2 8.1 -0.5 1.3 . 0.2 Feature vectors ... ... ……...
Example: discrete models • A collection of prototypes is constructed (100 to 250) • Each vector is replaced by its nearest prototype
2.1 -0.2 1.9 . -0.3 10.3 1.2 -0.9 . 0.2 8.1 -0.5 1.3 . 0.2 Feature vectors ... ... ……... ,,,39 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7,,,,, Prototypes
Phoneme assignment ffrrEEEnnnnddllIIII,,,kkOOOmmpjjuuuuttt$$$rrzz Prototypes 2276998900023448889211127780128897791237787622
For all utterances in database: Make phonetic transcription of a sentence Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Training of Acoustic Models Create New Models
Key Element in ASR • ASR is based on learning from observations • Huge amount of spoken data needed for making acoustic models • Huge amount of text data needed for making language models • => Lots of statistics, few rules
Overview • What is a speech database? • How is it used? • What does it contain? • How is it created? • Industrial needs • Technologies and applications
Contents of an ADB • Utterances of different utterance types • Utterance types suited to the intended application domain • Text balanced for phoneme and/or diphone distribution • All enriched with annotations
Contents of an ADBSpontaneous v. Read Utterances • A spontaneous utterance is a response to a question or a request • “In which city do you live?” • “Please spell a letter heading to your secretary” • “Is English your mother tongue?” • “Make a hotel reservation” • A readutterance is an utterance read from a presentation text • “London” • “Dear John” • “Yes” • “Please book me a room for 2 persons with bath. We will arrive ….”
Contents of an ADB • Isolated Phonetically Rich Word • Apple Tree, Lobster • Isolated Digit • 5 • Isolated Alphabet • B • Isolated number (natural number) • 4256
Contents of an ADB • Continuous Digits • 9 1 1 • Continuous Alphabet • Y M C A • Commands • Stop, left, print, call, next
Contents of an ADBConnected Digits • Telephone Numbers • 057/228888 • Credit Card Numbers • 3741 959289 310001 • Pin-codes • 8978 • Social Security Number • 560228 561 80 • Other identification numbers, e.g. sheet id • 012589225712
Contents of an ADBTime and Date Expressions • Time (“analog”, word style) • A quarter past two • Time (“digital”) • 14:15 • 2:15PM • Date (“analog”, word style, absolute) • Friday, June 25th, 1999 • Christmas’ Eve, Easter • Date (“digital”, absolute) • 25/06/99 • Date (“analog”, word style, relative) • Tomorrow, next week, in one month