ASRU 2007 Survey

ASRU 2007 Survey Presented by Shih-Hsiang

Outline • LVCSR • Building a Highly Accurate Mandarin Speech Recognizer • Univ. of Washington, SRI, ICSI, NTU • Development of the 2007 RWTH Mandarin LVCSR System • RWTH • The TITECH Large Vocabulary WFST Speech Recognition System • Tokyo Institute of Technology • Development of a Phonetic System for Large Vocabulary Arbic Speech Recognition • Cambridge • Uncertainty in Training Large Vocabulary Speech Recognizers (Focus on Graphical Model) • Univ. of Washington • Advances in Arabic Broadcast News Transcription at RWTH • RWTH • The IBM 2007 Speech Transcription System for European Parliamentary Speeches (Focus on Language Adaptation) • IBM, Univ. of Southern California • An Algorithm for Fast Composition of Weighted Finite-State Transducers • Univ. of Saarland, Univ. of Karlsruhe

Outline (cont.) • A Mandarin Lecture Speech Transcription System for Speech Summarization • Univ. of Science and Technology, Hong Kong

Outline (cont.) • Spoken Document Retrieval and Summarization • Fast Audio Search using Vector-Space Modeling • IBM • Soundbite Identification Using Reference and Automatic Transcripts of Broadcast News Speech • Univ. of Texts at Dallas • A System for Speech Driven Information Retrieval • Universidad de Valladolid • SPEECHFIND for CDP: Advances in Spoken Document Retrieval for the U.S. Collaborative Digitization Program • Univ. of Texas at Dallas • A Study of Lattice-Based Spoken Term Detection for Chinese Spontaneous Speech (Spoken Term Detection) • Microsoft Research Asia

Outline (cont.) • Speaker Diarization • Never-Endind Learning System System for On-Line Speaker Diarization • NICT-ATR • Multiple Feature Combination to Improve Speaker Diarization of Telephone Conversations • Centre de Recherche Informatique de Montreal • Efficient Use of Overlap Information in Speaker Diarization • Univ. of Washington • Others • SENSEI: Spoken English Assessment for Call Center Agents • IBM • The LIMSI QAST Systems: Comparison Between Human and Automatic Rules Generation for Question for Question-Answering on Speech Transcriptions • LIMSI • Topic Identification from Audio Recordings using Word and Phone Recognition Lattices • MIT

Reference • [Ref 1] X. Lei, et al, “Improved tone modeling for Mandarin broadcast news speech recognition,” in Proc. Interspeech, 2006 • [Ref 2] F. Valente, H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theory of evidence,” in Proc. ICASSP, 2007 • [Ref 3] A. Zolnay, et al., “Acoustic feature combination for robust speech recognition,” in Proc. ICASSP, 2005 • [Ref 4] F. Wessel, et al., “Explicit word error minimization using word hypothesis posterior probabilities,” in Proc. ICASSP, 2001

Corpora Description • Acoustic Data • 866 hours of speech data collected by LDC (Training) • Mandarin Hub4 (30 hours), TDT4 (89 hours), and GALE Year 1 (747 hours) corpora for training our acoustic models • Span from 1997 through July 2006, from shows on CCTV, RFA, NTDTV, PHOENIX, ANHUI, and so on • Test on three different test sets • DARPA EARS RT-04 evaluation set (eval04), DARPA GALE 2006 evaluation set (eval06), and GALE 2007 development set (dev07)

Corpora Description (cont.) • Text Corpora • The transcriptions of the acoustic training data, LDC Mandarin Gigaword corpus, GALE-related Chinese web text releases, and so on (1 billion words) • Lexicon • Step1: Starting from the BBN-modified LDC Chinese word lexicon, and manually augment it with a few thousand new words (both Chinese and English words)(70,000 words) • Step2: Re-segmenting the text corpora using longest-first match algorithm and train a unigram LM (Choose the most frequent 60,000 words) • Step3: Using ML word segmentation on the training text to extract the out-of-vocabulary (OOV) words • Step4: Retraining the N-gram LMs using the modified Kneser-Ney smoothing

Acoustic Systems • Create two subsystems having approximately the same error rate performance but with error behaviors as different as possible in order to compensate for each other • System ICSI • Phoneme Set • 70 phones for pronunciations • Additionally, there is one phone designated for silence, and another one for noises, laughter, and unknown foreign speech (context-independent) • Front-end features (74 dimensions per frame) • 13-dim MFCC, and its first- and second-order derivatives • spline smoothed pitch feature , and its first- and second-order derivatives • 32-dim phoneme-posterior features generated by multi-layer perceptrons (MLP)

Acoustic Systems (cont.) System ICSI • Spline smoothed pitch feature [Ref 1] • Since pitch is present only in voiced segments, the F0 needs to be interpolated in unvoiced regions to avoid variance problems in recognition • Interpolate the F0 contour with piecewise cubic Hermite interpolating polynomial (PCHIP) • PCHIP spline interpolation has no overshoots and less oscillation than conventional spline interpolation • Take the log of F0 • Moving window normalization (MWN) • subtracts the moving average of a long-span window (1-2 secs) • normalize out phrase-level intonation effects • 5-point moving average (MA) smoothing • smoothing reduces the noise in F0 features (符合中美二國根本利益) raw final feature raw f0 feature

Acoustic Systems (cont.) System ICSI • MLP feature • Providing discriminative phonetic information at the frame level • It involves three main steps • For each frame, concatenate its neighboring 9 frames of PLP and pitch features as the input to an MLP (43*9 inputs, 15000 hidden units, and 71 outputs units) • Each output unit of the MLP models the likelihood of the central frame belonging to a certain phone (Tandem phoneme posteriors features) • Excluded the nose phone (It’s not a very discriminable class) • Next, they separately construct a two-stage MLP where the first stage contains 19 MLPs and the second stage one MLP • The purpose of each MLP in the first stage, with 60 hidden units each, is to identify a different class of phonemes, based on the log energy of a different critical band across a long temporal context (51 frames ~ 0.5 seconds) • The second stage of MLP then combines the information from all of the hidden units (60*19) from the first stage to make a grand judgment on the phoneme identity for the central frame (8,000 hidden units) (HATs phoneme posteriors features) • Finally, the 71-dim Tandem and HATs posterior vectors are combined using the Dumpster-Shafer algorithm • Logarithm is then applied to the combined posteriors, followed by Principal component analysis (PCA) (dimension de-corrlection and dimension reduction)

Acoustic Systems (cont.) System ICSI

Acoustic Systems (cont.) System PLP • System-PLP • The system contains 42-dimension features with static, first- and second-order derivatives of PLP features • In order to compete with the ICSI-model which has a stronger feature representation, an fMPE feature transform is learned for the PLP-model. • The fMPE transform is trained by computing the high-dimension Gaussian posteriors of 5 neighboring frames, given a 3500x32 cross-word tri-phone ML-trained model with an SAT transform (3500*32*5=560K) • For tackling spontaneous speech, they additionally introduce a few diphthongs in the PLP-Model

Acoustic Systems (cont.) • Acoustic model in more detail • Decision-tree based HMM state clustering • 3500 shared states, each with 128 Gaussians • A cross-word tri-phone model with the ICSI-feature is trained with an MPE objective function • SAT feature transform based on 1-class constrained MLLR

Decoding Architecture

Decoding Architecture (cont.) • Acoustic Segmentation • Their segmenter is run with a finite state grammar • Their segmenter makes use of broad phonetic knowledge of Mandarin and models the input recording with five words • silence, noise, a Mandarin syllable with a voiceless initial, a Mandarin syllable with a voiced initial, and a non-Mandarin word • Each pronunciation phone (bg, rej, I1, I2, F, forgn) is modeled by a 3-state HMM, with 300 Gaussian per state • The minimum speech duration is reduced to 60 ms

Decoding Architecture (cont.) • Auto Speaker Clustering • Using Gaussian mixture models of static MFCC features and K-means clustering • Search with Trigrams and Cross Adaptation • The decoding is composed of three trigram recognition passes • ICSI-SI • Speaker independent (SI) within-word tri-phone MPE-trained ICSI-model and the highly pruned trigram LM  gives a good initial adaptation hypothesis quickly • PLP-Adapt • Use the ICSI hypothesis to learn the speaker-dependent SAT transform and to perform MLLR adaptation per speaker, on the cross-word tri-phone SAT+fMPE MPE trained PLP-model • ICSI-Adapt • Using the top 1 PLP hypothesis to adapt the cross-word tri-phone SAT MPE trained ICSI-model

Decoding Architecture (cont.) • Topic-Based Language Model Adaptation • Using a Latent Dirichlet Allocation (LDA) topic model • During decoding, they infer the topic mixture weights dynamically for each utterance • Then Select the top few most relevant topics above a threshold, and use their weights in θ to interpolate with the topic independent N-gram background language model • Weight the words in w based on an N-best-list derived confidence measure • Include words not only from the utterance being rescored but also from surrounding utterances in the same story chunk via a decay factor • The adapted n-gram is then used to rescore the N-best list When the entire system is applied to eval06, the CER is 15.3%

Corpora Description • Phoneme Set • The phoneme set is a subset of SAMPA-C • 14 vowels and 26 consonants • Tone information is included following the main-vowel principle • Tone 3 and 5 are merged for all vowels • For the phoneme @’, they merge tone 1 and 2 • Resulting phoneme set consist of 81 tonemes, and additional garbage phone and silence • Lexicon • Based on LC-Star Mandarin lexicon (96k words) • The unknown word are segmented into a sequence of known words by applying a longest-match segmenter • Language models are as the same as Univ. Washington and SRI • Recognition experiments  pruned 4-gram LMs • Word graph rescoring  full LMs

Acoustic Modeling • The final system consists of four independent subsystems • System1 (s1) • MFCC features (+segment-wise CMVN) • For each frame, concatenating its neighboring 9 frames and projected to a 45 dimensional feature space (done by LDA) • Tone feature and its first and second derivative are also augmented to the feature vector • System 2 (s2) and system 3 (s3) are equal to s1 beside the based features • s2 uses PLPs feature • s3 uses gammatone cepstral coefficients • System 4 (s4) stats with the same acoustic front-end as s1, but the features are augmented with phoneme posterior features produced by a neural network • The input of the net are multiple time resolution features (MRASTA) • The dimension of the phoneme posterior features is reduced by a PCA to 24

Acoustic Modeling • Acoustic Training • The acoustic models for all systems are based on tri-phones with cross-word context • Modeled by a 3-state left-to right HMM • A decision tree based stat tying is applied (4,500 generalized tri-phone states) • The filter-banks of the MFCC and PLP feature extraction are normalized by applying a 2-pass VTLN (not for s3 system) • Speaker variations are compensated by using SAT/CMLLR • Additionally, in recognition MLLR is applied to update the mean of the AMs • MPE is used for discriminative AMs training

System Development • Acoustic Feature Combination • The literature contains several way to combine different feature streams • Concatenate the individual feature vectors • Feed the features streams into a single LDA • Perform the integration in a log-linear model • For fewer data, the log-linear model combination • gives some nice improvement over the simple • concatenation approach • But with more training data the benefit declines and • for the 870 hours setup we see no improvement at all

System Development (cont.) • Consensus Decoding And System Combination • min.fWER (minimum time frame error) based consensus decoding • min.fWER combination • ROVER with confidence scores. • The approximated character boundary times effectively work as good as the boundaries derived from • a forced alignment • For almost all experiments, there is no difference in minimizing WER or CER • Only ROVER seems to benefit from switching to characters

Decoding Framework • Multi-Pass recognition • 1. pass: no adaptation • 2. pass: 2-pass-VTLN • 3. pass: SAT/CMLLR • 4. pass: MLLR • 5. pass: LM rescoring • The five passes result in an overall reduction in CER of • about 10% relative for eval06 and about 9% for dev07 • The MPE trained models give a further reduction in the • CER resulting in a 12% to 15% relative decrease over • all passes • Adding the 358 hours of extra data to the MPE training • slightly decreases the CER and the total relative • improvement is about 16% for both corpora • LM.v2 (4-gram) outperforms LM.v1(5-gram) by about 0.8% • absolute in CER consistently for all systems and passes

Decoding Framework (cont.) • Cross Adaptation • Use s4 to adapt s1, s2 and s3

Introduction • The goal is to build a fast, scalable, flexible decoder to operate on weighted finite state transducers (WFSTs) search spaces • WFSTs provide a common and natural representation for HMM models context dependency pronunciation dictionaries grammars and alternative recognition outputs • Within the WFSTs paradigm all the knowledge sources in the search space are combined together to form a static search network • The composition often happens off-line before decoding and there exist powerful operations to manipulate and optimise the search networks • The fully composed networks can often be very large and therefore at both composition and decode time large amounts of memory can be required • Solution: on-the-fly composition of the network , disk based search networks and so on

Evaluations • Evaluations were carried out using the Corpus of Spontaneous Japanese (CSJ) • contains a total of 228 hours of training data from 953 lectures • 38 dimensional feature vectors with a 10ms frame rate and 25ms window size • The language model was back-off trigram with a vocabulary of 25k words • On the testing data, the language model perplexity was 57.8 and the out of vocabulary rate was 0.75% • 2328 utterances which spanned 10 lectures • The experiments were conducted on a 2.40GHz Intel Core2 machines with 2GB of memory and an Nvidia 8800GTX graphics processor running Linux

Evaluations (cont.) • HLevel and CLevel WFST Evaluations • CLevel (C。L。G) • HLevel (H。C。L。G) C: context dependency, L: lexicon G:LMs, H:ACs • Recognition experiments were run with the beam • width varied from 100 to 200 • CLevel – 2.1M states and 4.3M arcs • required 150MBs ~ 170 MBs memory • HLevel – 6.2M states and 7.7M arcs • required 330MBs ~ 400 MBs memory • Julius – required 60MBs ~ 100 MBs memory *For narrow beams the HLevel decoder was slightly faster and achieved a marginally higher accuracy, showing the better optimized HLevel networks can be used with small overhead using singleton arcs.

Evaluations (cont.) • Multiprocessor Evaluations • The decoder was additionally run in multi-threaded mode using one and two threads to take advantages of both of the cores in the processor • The multi-threaded decoder using two threads is able to achieve higher accuracy for the same beam when compared to a single-threaded decoder • There are parts of the decoding where each thread uses a local best cost for pruning and not the absolute best cost at that point in time

Evaluations (cont.)

Introduction • The authors presented a two-stage method for fast audio search and spoken term detection • Using vector-space modeling approach to retrieve a short list of candidate audio segments for a query • Word lattice based • The list of candidate segments is then searched using a word based index for known words and a phone-based index for out-of-vocabulary words

Lattice-Based Indexing for VSM • For vector-space modeling, it is necessary to extract an unordered list of terms of interest from each document in the database • raw count, TF/IDF, … etc. • In order to accomplish this for lattices, We can extract the expected counts of each term • The training documents are using reference transcripts, instead of lattices or the 1-best output of a recognizer • The unordered list of terms also extract from the most frequently occurring 1-gram tokens in the training documents • However, this does not account for OOV terms in a query the count of term wi in path l the complete set of paths in the lattice

Experimental Results • Experiment Setup • Broadcast news audio search task • 2.79 hours / 1107 query terms • 1408 segments • Two ASR systems • ASR System 1: 250K, SI+SA decode • 6000 quinphone context-dependent states, 250k Guassians • ASR System 2: 30K, SI only decode • 3000 triphone context dependent states, 30K • Both of these systems use a 4-gram language model, built from a 54M n-gram corpus

Experimental Results

Introduction • Soundbite identification in broadcast news is important for locating information • useful for question answering, mining opinions of a particular person, and enriching speech recognition output with quotation marks • This paper presents a systematic study of this problem under a classification framework • Problem formulation for classification • Feature extraction • The effect of using automatic speech recognition (ASR) output • Automatic sentence boundary detection

Classification Framework for Soundbite Identification support vector machine (SVM) classifier

Classification Framework for Soundbite Identification (cont.) • Problem formulation • Binary classification • Soundbite versus not • Three-way classification • Anchor, reporter, or a soundbite • Feature Extraction (each speech turn is represented as a feature vector) • Lexical features • LF-1 • Unigram and bigram features in the first and the last sentence of the current speech turn  for speaker roles • LF-2 • Unigram and bigram features from the last sentence of the previous turn and from the first sentence of the following turn functional transition among different speakers • Structural features • Number of words in the current speech turn • Number of sentences in the current speech turn • Average number of words in each sentence in the current speech turn

Classification Framework for Soundbite Identification (cont.) • Feature Weighting • Notation • N is the number of speech turns in the training collection • M is the total number of features • fik is the frequency of feature φi in the k-th speech turn • ni denotes the number of speech turns containing feature φi • F(φi ) means the frequency of feature φi in the collection • wik is the weight assigned to feature φi in the k-th turn • Frequency Weighting • TF-IDF Weighting • TF-IWF Weighting • Entropy Weighting

Experimental Results • Experimental Setup • TDT4 Mandarin broadcast news data • 335 news shows • Performance Measure • Accuracy, precision, recall, f-measure

Experimental Results • Comparison of Different Weighting Methods • using global information generally perform much better than simply using local information • different problem formulations seem to prefer different weighting methods • entropy-based weighting is moretheoretic and seems to be a promisingweighting choice • Contribution of Different Types of Features • adding contextual features improves the performance • Removing low-frequency features(i.e., Cutoff-1) helps in classification

Experimental Results • Impact of Using ASR Output • Speech recognition errors hurt the system performance • Automatic sentence boundary detection degrades performanceeven more • Three-way classification strategy generally outperforms the binary setup REF: human transcripts ASR_ASB: ASR output and automatic sentence segmentation ASR_RSB: ASR output and manually segmentation

Introduction • Speech driven information retrieval is a more difficult task than text-based information retrieval • Because spoken queries contain less redundancy to overcome speech recognition errors • Longer queries are more robust to errors than shorter ones • Three types of errors that affect retrieval performance • out of vocabulary (OOV) words • errors produced by words in a foreign language • regular speech recognition errors • Solutions • OOV problem • Two-pass strategy to adapt the Lexicons and LMs • Foreign words problem • Added the pronunciation of foreign words to pronunciation lexicon

System Overview IR Engine: VSM + Rocchio’s pseudo relevance feedback

ASRU 2007 Survey

ASRU 2007 Survey

Presentation Transcript

2007/2008 Household Travel Survey

2007 MWCOG Household Travel Survey

“Employee Survey 2007”

2007 Triennial Salary Survey

2007 TAC SCM Survey

2007 Communication Survey Results

2007 Cultural Survey

2007 Citizen Survey

2007 CSI Survey Stats

Field inspection survey in 2007

2007 CDIM Survey

2007 CIRP Freshman Survey

TFMA 2007 Freeboard Survey

2007 Metro Residents Survey

Massachusetts Employer Survey 2007

2007 ACUMC Survey

Membership Survey 2007

BEST SURVEY 2007

BEST SURVEY 2007

Inpatient Survey 2007

APNIC Survey 2007

2007 SUMMER SPENDING SURVEY