Speech Recognition: It Takes a Village to Raise a Child

Speech Recognition: It Takes a Village to Raise a Child Special thanks to: Stan Chen, Yuqing Gao, Ramesh Gopinath, Makis Potamianos, Bhuvana Ramabhadran, Bowen Zhou , and Geoff Zweig Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center

http://www.nist.gov/speech/tests/rt/rt2003/spring/presentations/rt03s-stt-results-v9.pdfhttp://www.nist.gov/speech/tests/rt/rt2003/spring/presentations/rt03s-stt-results-v9.pdf

Developmental Factors • Bulk of improvements from better modeling and more data, closely followed by adaptation (a form of modeling)

Continue the Basics: Advances in Gaussian Modeling • Most current systems model speech as a mixture of diagonal Gaussians, but there is this nagging suspicion that full-covariance models would be better. • Try to approximate full-covariance models with controlled increase in number of parameters (Axelrod, 2003):

Advances in Gaussian Modeling

Advances in Gaussian Modeling • 10k FC Model better than 600k model with 20% of the parameters • FC models clearly prone to overtraining. PCGMM helps but still increases number of parameters • Clearly need lots more acoustic data to train even PCGMM models much less FC models

Past history is littered with failure Teens : Utilizing Linguistic Information in ASR • Standard LVCSR does not explicitly use linguistic information • Over the last few years area beginning to show signs of life

å - - - = × r i 1 i 1 ended i 1 P ( w | W ) P ( w | W , T ) ( T | W ) VP i 1 i 1 i i 1 nti-1 contract Î T S i i NP nti-2 å - = × r i 1 P ( w | w , w , h , h , nt , nt ) ( T | W ) - - - - - - i i 2 i 1 i 2 i 1 i 2 i 1 i 1 Î T S The contract ended with a loss of 7 cents after i i DT NN VBD IN DT NN IN CD NNS h h w w w i i-2 i-1 i-2 i-1 Syntactic Structured LM Exploiting syntactic dependencies (Chelba ACL98, Wu ICASSP00) • Observe performance improvements ~1% absolute on SWB/BN

S BOOK SEGMENT LOC-TO RT-OW LOC DATE TIME null null null book null rt-ow rt-ow flight null city state word day timerng I want to book a one way ticket to Houston Texas for tomorrow morning Semantic Structured LM Exploiting semantic dependencies (Erdogan ICSLP02) • Reductions in error rate by 20% for limited domain tasks

Semantic Parser Dialogue State World Knowledge W , ..., W 1 N Speaker (turn, gender, ID) Named Entity Syntactic Parser Super-Structured LM for LVCSR • Such an LM would clearly require substantially more annotated data than • currently available

Nutrition: “There’s no data like more data” --Robert L. Mercer LIMSI: Lamel (2002) RT03 Workshop (BBN)

Nutrition: “There is no data like more data” • “SuperStructured LM” probably need 100x current data • A large amount of linguistic knowledge sources now available • Brown corpus, Penn Treebank (syntactically & semantically annotated) • WordNet and FrameNet, Cyc ontologies, • Online dictionaries and thesaurus • Text data from WWW • How to provide necessary annotation at reasonable cost – may require community effort.

Acoustic signal Speech/non-speech segmentation 0.01 xRT Speaker Independent Decoding 0.11 xRT Adaptive Transforms 0.1 xRT Speaker-Adaptive Decoding 0.63 xRT 0.85 xRT Words Speed, Data & Computing • Decoding today is fast • ML training even faster • Discriminative training same as decoding • ~5-10xRT for numerous iterations • But • The data is growing, e.g. the EARS program aiming for • 2000 hrs/year telephony • 5000 hrs/year BN • ~10x increase from current • Evidence suggests that new & costlier algorithms are necessary to exploit more data • So • Need minimum 10x increase in compute power just to track data • 100x to run 10xRT programs rather than 1xRT programs

The BlueGene Frontier • 200 TeraFlop computer • Combined power of top 500 supercomputers in 2001 • 65,000 processors • 2GB per processor • ~1GHz clock • 3D torus interconnection • 2 nodes per card; 16 cards per board; 16 boards per plane; 2 planes per rack • Pieces beginning to be tested • Intended for molecular dynamics, but available for other uses • Potential ASR applications • Physics based articulatory modeling • Brute-force parameter adjustment to minimize WER • Large scale neural network modeling • Incorporation of Visual Processing

The Village: Collaborative Paradigms • Originally progress in ASR was haphazard – no way to compare results and would generate a lot of skepticism because of NIH syndrome • Evaluation-driven ASR programs (Prior DARPA, DOD Paradigms) • Provided a common metric to compare algorithms • Funding based on relative performance of each site • Sites hope not only to do well, but for other sites to do badly • Discourages free exchange of resources between sites • Large portion of each site’s effort spent replicating other sites’ algorithms + data • Non-evaluation driven programs: NGSW, MALACH • How to encourage collaboration while retaining the motivation of competition? • While also retaining objective evaluation of progress • Recent EARS program a strong step in this direction • Even broader collaboration possible through sharing resources

Encouraging Collaboration • Decompose ASR systems into modules • Front end; acoustic model; pronunciation model; language model; adaptation; search; etc. • Sites collaborate to create single ASR system rather than one per site • Each site works on writing better ASR modules, rather than complete ASR systems • Each module (e.g., MMIE, VTLN, etc.) needs only be implemented once across all sites • Progress measured and credit assigned to sites based on how modules affect WER of global system

Existence Proof: UIMA(Unstructured Information Management Architecture)http://www.research.ibm.com/people/a/aspector/presentations/www2000f.pdfCharts courtesy of David Ferrucci • Accelerate Progress in Search and Analysis • Reuse across teams • Ease of experimentation • Combination Hypothesis

Semantic Classes identified Document labeled with Language of text Document with tokens (e.g., words) identified Document with Name identified Word labeled with its part of speech Document with HTML tags identified and content extracted Document Part of Speech Named -Entities Language Identification Detagger Tokenizer Semantic Classes phrase level word level document level Text analysis through a series of annotators Annotators:Analyze, Recognize & Label specific semantic content for next consumer

Acquisition Unstructured Information Unstructured Information Analysis Crawlers Collection Processing Manager Collection Level Analyses UIM Application (Text) Analysis Engines (Document-Level) Analysis Access Client/User Document, Collection & Metadata Store Semantic Search Engine Knowledge Source Access Documents Collections Metadata Indices Knowledge & Data Bases Indices Analysis Engine Directory Knowledge Source Adapter Directory Structured Information Component Discovery

Possible Collaboration Discussion Points • Sharing data and object files seems reasonable • Speech community needs to design: • Standard file formats supporting rich annotation • Stable, general, open-source C++ interfaces for front end modules, acoustic models, LM’s, etc. • Rich tool set • Port competitive trainer, decoder, adaptation into this framework; create basic file manipulation tools • Can we ride on top of existing architectures such as UIMA?

Summary • To help our “speech recognition” child develop, continue basic successful approaches of the past: • Better Modeling • More Data • Increasing difficulty of problem requires focus on community-wide collaboration for both algorithms and resources

Speech Recognition: It Takes a Village to Raise a Child

Speech Recognition: It Takes a Village to Raise a Child

Presentation Transcript

pliq.me mobile speech-to-text recognition service (russian)

The Millennial Generation : A Blessing or Curse for the Workforce

Free Speech/1 st Amendment

Speech Recognition

Strong Communities Raise Strong Kids

Pattern Recognition

Why Inner Speech?

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Analogy-Making

A Tutorial on Bayesian Speech Feature Enhancement

Language models for speech recognition Bhiksha Raj and Rita Singh

Design and Implementation of Speech Recognition Systems

Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay

Feature Extraction for speech applications

Single and Multi Channel Feature Enhancement for Distant Speech Recognition

Conditional Random Fields for Automatic Speech Recognition

Novel Speech Recognition Models for Arabic

The Millennial Generation : A Blessing or Curse for the Workforce

Understanding and Motivating Different Generations: Our Students, Our Employees, Our Future

Neural Networks

The Millennial Generation: From the Classroom to the Workforce, What Can We Expect