650 likes | 671 Views
ASRU 2007 Survey. Presented by Shih-Hsiang. Outline. LVCSR Building a Highly Accurate Mandarin Speech Recognizer Univ. of Washington, SRI, ICSI, NTU Development of the 2007 RWTH Mandarin LVCSR System RWTH The TITECH Large Vocabulary WFST Speech Recognition System
E N D
ASRU 2007 Survey Presented by Shih-Hsiang
Outline • LVCSR • Building a Highly Accurate Mandarin Speech Recognizer • Univ. of Washington, SRI, ICSI, NTU • Development of the 2007 RWTH Mandarin LVCSR System • RWTH • The TITECH Large Vocabulary WFST Speech Recognition System • Tokyo Institute of Technology • Development of a Phonetic System for Large Vocabulary Arbic Speech Recognition • Cambridge • Uncertainty in Training Large Vocabulary Speech Recognizers (Focus on Graphical Model) • Univ. of Washington • Advances in Arabic Broadcast News Transcription at RWTH • RWTH • The IBM 2007 Speech Transcription System for European Parliamentary Speeches (Focus on Language Adaptation) • IBM, Univ. of Southern California • An Algorithm for Fast Composition of Weighted Finite-State Transducers • Univ. of Saarland, Univ. of Karlsruhe
Outline (cont.) • A Mandarin Lecture Speech Transcription System for Speech Summarization • Univ. of Science and Technology, Hong Kong
Outline (cont.) • Spoken Document Retrieval and Summarization • Fast Audio Search using Vector-Space Modeling • IBM • Soundbite Identification Using Reference and Automatic Transcripts of Broadcast News Speech • Univ. of Texts at Dallas • A System for Speech Driven Information Retrieval • Universidad de Valladolid • SPEECHFIND for CDP: Advances in Spoken Document Retrieval for the U.S. Collaborative Digitization Program • Univ. of Texas at Dallas • A Study of Lattice-Based Spoken Term Detection for Chinese Spontaneous Speech (Spoken Term Detection) • Microsoft Research Asia
Outline (cont.) • Speaker Diarization • Never-Endind Learning System System for On-Line Speaker Diarization • NICT-ATR • Multiple Feature Combination to Improve Speaker Diarization of Telephone Conversations • Centre de Recherche Informatique de Montreal • Efficient Use of Overlap Information in Speaker Diarization • Univ. of Washington • Others • SENSEI: Spoken English Assessment for Call Center Agents • IBM • The LIMSI QAST Systems: Comparison Between Human and Automatic Rules Generation for Question for Question-Answering on Speech Transcriptions • LIMSI • Topic Identification from Audio Recordings using Word and Phone Recognition Lattices • MIT
Reference • [Ref 1] X. Lei, et al, “Improved tone modeling for Mandarin broadcast news speech recognition,” in Proc. Interspeech, 2006 • [Ref 2] F. Valente, H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theory of evidence,” in Proc. ICASSP, 2007 • [Ref 3] A. Zolnay, et al., “Acoustic feature combination for robust speech recognition,” in Proc. ICASSP, 2005 • [Ref 4] F. Wessel, et al., “Explicit word error minimization using word hypothesis posterior probabilities,” in Proc. ICASSP, 2001
Corpora Description • Acoustic Data • 866 hours of speech data collected by LDC (Training) • Mandarin Hub4 (30 hours), TDT4 (89 hours), and GALE Year 1 (747 hours) corpora for training our acoustic models • Span from 1997 through July 2006, from shows on CCTV, RFA, NTDTV, PHOENIX, ANHUI, and so on • Test on three different test sets • DARPA EARS RT-04 evaluation set (eval04), DARPA GALE 2006 evaluation set (eval06), and GALE 2007 development set (dev07)
Corpora Description (cont.) • Text Corpora • The transcriptions of the acoustic training data, LDC Mandarin Gigaword corpus, GALE-related Chinese web text releases, and so on (1 billion words) • Lexicon • Step1: Starting from the BBN-modified LDC Chinese word lexicon, and manually augment it with a few thousand new words (both Chinese and English words)(70,000 words) • Step2: Re-segmenting the text corpora using longest-first match algorithm and train a unigram LM (Choose the most frequent 60,000 words) • Step3: Using ML word segmentation on the training text to extract the out-of-vocabulary (OOV) words • Step4: Retraining the N-gram LMs using the modified Kneser-Ney smoothing
Acoustic Systems • Create two subsystems having approximately the same error rate performance but with error behaviors as different as possible in order to compensate for each other • System ICSI • Phoneme Set • 70 phones for pronunciations • Additionally, there is one phone designated for silence, and another one for noises, laughter, and unknown foreign speech (context-independent) • Front-end features (74 dimensions per frame) • 13-dim MFCC, and its first- and second-order derivatives • spline smoothed pitch feature , and its first- and second-order derivatives • 32-dim phoneme-posterior features generated by multi-layer perceptrons (MLP)
Acoustic Systems (cont.) System ICSI • Spline smoothed pitch feature [Ref 1] • Since pitch is present only in voiced segments, the F0 needs to be interpolated in unvoiced regions to avoid variance problems in recognition • Interpolate the F0 contour with piecewise cubic Hermite interpolating polynomial (PCHIP) • PCHIP spline interpolation has no overshoots and less oscillation than conventional spline interpolation • Take the log of F0 • Moving window normalization (MWN) • subtracts the moving average of a long-span window (1-2 secs) • normalize out phrase-level intonation effects • 5-point moving average (MA) smoothing • smoothing reduces the noise in F0 features (符合中美二國根本利益) raw final feature raw f0 feature
Acoustic Systems (cont.) System ICSI • MLP feature • Providing discriminative phonetic information at the frame level • It involves three main steps • For each frame, concatenate its neighboring 9 frames of PLP and pitch features as the input to an MLP (43*9 inputs, 15000 hidden units, and 71 outputs units) • Each output unit of the MLP models the likelihood of the central frame belonging to a certain phone (Tandem phoneme posteriors features) • Excluded the nose phone (It’s not a very discriminable class) • Next, they separately construct a two-stage MLP where the first stage contains 19 MLPs and the second stage one MLP • The purpose of each MLP in the first stage, with 60 hidden units each, is to identify a different class of phonemes, based on the log energy of a different critical band across a long temporal context (51 frames ~ 0.5 seconds) • The second stage of MLP then combines the information from all of the hidden units (60*19) from the first stage to make a grand judgment on the phoneme identity for the central frame (8,000 hidden units) (HATs phoneme posteriors features) • Finally, the 71-dim Tandem and HATs posterior vectors are combined using the Dumpster-Shafer algorithm • Logarithm is then applied to the combined posteriors, followed by Principal component analysis (PCA) (dimension de-corrlection and dimension reduction)
Acoustic Systems (cont.) System ICSI
Acoustic Systems (cont.) System PLP • System-PLP • The system contains 42-dimension features with static, first- and second-order derivatives of PLP features • In order to compete with the ICSI-model which has a stronger feature representation, an fMPE feature transform is learned for the PLP-model. • The fMPE transform is trained by computing the high-dimension Gaussian posteriors of 5 neighboring frames, given a 3500x32 cross-word tri-phone ML-trained model with an SAT transform (3500*32*5=560K) • For tackling spontaneous speech, they additionally introduce a few diphthongs in the PLP-Model
Acoustic Systems (cont.) • Acoustic model in more detail • Decision-tree based HMM state clustering • 3500 shared states, each with 128 Gaussians • A cross-word tri-phone model with the ICSI-feature is trained with an MPE objective function • SAT feature transform based on 1-class constrained MLLR
Decoding Architecture (cont.) • Acoustic Segmentation • Their segmenter is run with a finite state grammar • Their segmenter makes use of broad phonetic knowledge of Mandarin and models the input recording with five words • silence, noise, a Mandarin syllable with a voiceless initial, a Mandarin syllable with a voiced initial, and a non-Mandarin word • Each pronunciation phone (bg, rej, I1, I2, F, forgn) is modeled by a 3-state HMM, with 300 Gaussian per state • The minimum speech duration is reduced to 60 ms
Decoding Architecture (cont.) • Auto Speaker Clustering • Using Gaussian mixture models of static MFCC features and K-means clustering • Search with Trigrams and Cross Adaptation • The decoding is composed of three trigram recognition passes • ICSI-SI • Speaker independent (SI) within-word tri-phone MPE-trained ICSI-model and the highly pruned trigram LM gives a good initial adaptation hypothesis quickly • PLP-Adapt • Use the ICSI hypothesis to learn the speaker-dependent SAT transform and to perform MLLR adaptation per speaker, on the cross-word tri-phone SAT+fMPE MPE trained PLP-model • ICSI-Adapt • Using the top 1 PLP hypothesis to adapt the cross-word tri-phone SAT MPE trained ICSI-model
Decoding Architecture (cont.) • Topic-Based Language Model Adaptation • Using a Latent Dirichlet Allocation (LDA) topic model • During decoding, they infer the topic mixture weights dynamically for each utterance • Then Select the top few most relevant topics above a threshold, and use their weights in θ to interpolate with the topic independent N-gram background language model • Weight the words in w based on an N-best-list derived confidence measure • Include words not only from the utterance being rescored but also from surrounding utterances in the same story chunk via a decay factor • The adapted n-gram is then used to rescore the N-best list When the entire system is applied to eval06, the CER is 15.3%
Corpora Description • Phoneme Set • The phoneme set is a subset of SAMPA-C • 14 vowels and 26 consonants • Tone information is included following the main-vowel principle • Tone 3 and 5 are merged for all vowels • For the phoneme @’, they merge tone 1 and 2 • Resulting phoneme set consist of 81 tonemes, and additional garbage phone and silence • Lexicon • Based on LC-Star Mandarin lexicon (96k words) • The unknown word are segmented into a sequence of known words by applying a longest-match segmenter • Language models are as the same as Univ. Washington and SRI • Recognition experiments pruned 4-gram LMs • Word graph rescoring full LMs
Acoustic Modeling • The final system consists of four independent subsystems • System1 (s1) • MFCC features (+segment-wise CMVN) • For each frame, concatenating its neighboring 9 frames and projected to a 45 dimensional feature space (done by LDA) • Tone feature and its first and second derivative are also augmented to the feature vector • System 2 (s2) and system 3 (s3) are equal to s1 beside the based features • s2 uses PLPs feature • s3 uses gammatone cepstral coefficients • System 4 (s4) stats with the same acoustic front-end as s1, but the features are augmented with phoneme posterior features produced by a neural network • The input of the net are multiple time resolution features (MRASTA) • The dimension of the phoneme posterior features is reduced by a PCA to 24
Acoustic Modeling • Acoustic Training • The acoustic models for all systems are based on tri-phones with cross-word context • Modeled by a 3-state left-to right HMM • A decision tree based stat tying is applied (4,500 generalized tri-phone states) • The filter-banks of the MFCC and PLP feature extraction are normalized by applying a 2-pass VTLN (not for s3 system) • Speaker variations are compensated by using SAT/CMLLR • Additionally, in recognition MLLR is applied to update the mean of the AMs • MPE is used for discriminative AMs training
System Development • Acoustic Feature Combination • The literature contains several way to combine different feature streams • Concatenate the individual feature vectors • Feed the features streams into a single LDA • Perform the integration in a log-linear model • For fewer data, the log-linear model combination • gives some nice improvement over the simple • concatenation approach • But with more training data the benefit declines and • for the 870 hours setup we see no improvement at all
System Development (cont.) • Consensus Decoding And System Combination • min.fWER (minimum time frame error) based consensus decoding • min.fWER combination • ROVER with confidence scores. • The approximated character boundary times effectively work as good as the boundaries derived from • a forced alignment • For almost all experiments, there is no difference in minimizing WER or CER • Only ROVER seems to benefit from switching to characters
Decoding Framework • Multi-Pass recognition • 1. pass: no adaptation • 2. pass: 2-pass-VTLN • 3. pass: SAT/CMLLR • 4. pass: MLLR • 5. pass: LM rescoring • The five passes result in an overall reduction in CER of • about 10% relative for eval06 and about 9% for dev07 • The MPE trained models give a further reduction in the • CER resulting in a 12% to 15% relative decrease over • all passes • Adding the 358 hours of extra data to the MPE training • slightly decreases the CER and the total relative • improvement is about 16% for both corpora • LM.v2 (4-gram) outperforms LM.v1(5-gram) by about 0.8% • absolute in CER consistently for all systems and passes
Decoding Framework (cont.) • Cross Adaptation • Use s4 to adapt s1, s2 and s3
Introduction • The goal is to build a fast, scalable, flexible decoder to operate on weighted finite state transducers (WFSTs) search spaces • WFSTs provide a common and natural representation for HMM models context dependency pronunciation dictionaries grammars and alternative recognition outputs • Within the WFSTs paradigm all the knowledge sources in the search space are combined together to form a static search network • The composition often happens off-line before decoding and there exist powerful operations to manipulate and optimise the search networks • The fully composed networks can often be very large and therefore at both composition and decode time large amounts of memory can be required • Solution: on-the-fly composition of the network , disk based search networks and so on
Evaluations • Evaluations were carried out using the Corpus of Spontaneous Japanese (CSJ) • contains a total of 228 hours of training data from 953 lectures • 38 dimensional feature vectors with a 10ms frame rate and 25ms window size • The language model was back-off trigram with a vocabulary of 25k words • On the testing data, the language model perplexity was 57.8 and the out of vocabulary rate was 0.75% • 2328 utterances which spanned 10 lectures • The experiments were conducted on a 2.40GHz Intel Core2 machines with 2GB of memory and an Nvidia 8800GTX graphics processor running Linux
Evaluations (cont.) • HLevel and CLevel WFST Evaluations • CLevel (C。L。G) • HLevel (H。C。L。G) C: context dependency, L: lexicon G:LMs, H:ACs • Recognition experiments were run with the beam • width varied from 100 to 200 • CLevel – 2.1M states and 4.3M arcs • required 150MBs ~ 170 MBs memory • HLevel – 6.2M states and 7.7M arcs • required 330MBs ~ 400 MBs memory • Julius – required 60MBs ~ 100 MBs memory *For narrow beams the HLevel decoder was slightly faster and achieved a marginally higher accuracy, showing the better optimized HLevel networks can be used with small overhead using singleton arcs.
Evaluations (cont.) • Multiprocessor Evaluations • The decoder was additionally run in multi-threaded mode using one and two threads to take advantages of both of the cores in the processor • The multi-threaded decoder using two threads is able to achieve higher accuracy for the same beam when compared to a single-threaded decoder • There are parts of the decoding where each thread uses a local best cost for pruning and not the absolute best cost at that point in time
Introduction • The authors presented a two-stage method for fast audio search and spoken term detection • Using vector-space modeling approach to retrieve a short list of candidate audio segments for a query • Word lattice based • The list of candidate segments is then searched using a word based index for known words and a phone-based index for out-of-vocabulary words
Lattice-Based Indexing for VSM • For vector-space modeling, it is necessary to extract an unordered list of terms of interest from each document in the database • raw count, TF/IDF, … etc. • In order to accomplish this for lattices, We can extract the expected counts of each term • The training documents are using reference transcripts, instead of lattices or the 1-best output of a recognizer • The unordered list of terms also extract from the most frequently occurring 1-gram tokens in the training documents • However, this does not account for OOV terms in a query the count of term wi in path l the complete set of paths in the lattice
Experimental Results • Experiment Setup • Broadcast news audio search task • 2.79 hours / 1107 query terms • 1408 segments • Two ASR systems • ASR System 1: 250K, SI+SA decode • 6000 quinphone context-dependent states, 250k Guassians • ASR System 2: 30K, SI only decode • 3000 triphone context dependent states, 30K • Both of these systems use a 4-gram language model, built from a 54M n-gram corpus
Introduction • Soundbite identification in broadcast news is important for locating information • useful for question answering, mining opinions of a particular person, and enriching speech recognition output with quotation marks • This paper presents a systematic study of this problem under a classification framework • Problem formulation for classification • Feature extraction • The effect of using automatic speech recognition (ASR) output • Automatic sentence boundary detection
Classification Framework for Soundbite Identification support vector machine (SVM) classifier
Classification Framework for Soundbite Identification (cont.) • Problem formulation • Binary classification • Soundbite versus not • Three-way classification • Anchor, reporter, or a soundbite • Feature Extraction (each speech turn is represented as a feature vector) • Lexical features • LF-1 • Unigram and bigram features in the first and the last sentence of the current speech turn for speaker roles • LF-2 • Unigram and bigram features from the last sentence of the previous turn and from the first sentence of the following turn functional transition among different speakers • Structural features • Number of words in the current speech turn • Number of sentences in the current speech turn • Average number of words in each sentence in the current speech turn
Classification Framework for Soundbite Identification (cont.) • Feature Weighting • Notation • N is the number of speech turns in the training collection • M is the total number of features • fik is the frequency of feature φi in the k-th speech turn • ni denotes the number of speech turns containing feature φi • F(φi ) means the frequency of feature φi in the collection • wik is the weight assigned to feature φi in the k-th turn • Frequency Weighting • TF-IDF Weighting • TF-IWF Weighting • Entropy Weighting
Experimental Results • Experimental Setup • TDT4 Mandarin broadcast news data • 335 news shows • Performance Measure • Accuracy, precision, recall, f-measure
Experimental Results • Comparison of Different Weighting Methods • using global information generally perform much better than simply using local information • different problem formulations seem to prefer different weighting methods • entropy-based weighting is moretheoretic and seems to be a promisingweighting choice • Contribution of Different Types of Features • adding contextual features improves the performance • Removing low-frequency features(i.e., Cutoff-1) helps in classification
Experimental Results • Impact of Using ASR Output • Speech recognition errors hurt the system performance • Automatic sentence boundary detection degrades performanceeven more • Three-way classification strategy generally outperforms the binary setup REF: human transcripts ASR_ASB: ASR output and automatic sentence segmentation ASR_RSB: ASR output and manually segmentation
Introduction • Speech driven information retrieval is a more difficult task than text-based information retrieval • Because spoken queries contain less redundancy to overcome speech recognition errors • Longer queries are more robust to errors than shorter ones • Three types of errors that affect retrieval performance • out of vocabulary (OOV) words • errors produced by words in a foreign language • regular speech recognition errors • Solutions • OOV problem • Two-pass strategy to adapt the Lexicons and LMs • Foreign words problem • Added the pronunciation of foreign words to pronunciation lexicon
System Overview IR Engine: VSM + Rocchio’s pseudo relevance feedback