290 likes | 482 Views
EE 516 Lecture 1. Geoffrey Zweig Microsoft Research 4/2/2009. Our Topics. Introducing today!. From JHU 2002 SuperSID Final Presentation – Reynolds et al. Topic Coverage By Day. Data Representations and Models (4/23) Vector Quantization Gaussian Mixtures The EM Algorithm
E N D
EE 516 Lecture 1 Geoffrey Zweig Microsoft Research 4/2/2009
Our Topics Introducing today! From JHU 2002 SuperSID Final Presentation – Reynolds et al.
Topic Coverage By Day • Data Representations and Models (4/23) • Vector Quantization • Gaussian Mixtures • The EM Algorithm • Speaker Identification (5/7) • Language Identification (5/7) • Hidden Markov Models (5/14) • Dynamic Programming • Building a Speech Recognizer (5/14)
Language Identification – Why Do it? • Multi-lingual society • Applications should be able to deal with anyone • Businesses • Automated help systems • Reservations, account access, etc. • Travel • Airport Kiosks • Train stations • Government • Funds research to identify languages • Runs evaluations in it
How Do You Do it? English Acoustic Model French Acoustic Model Output Likeliest … Tamil Acoustic Model Gaussian Mixture Models - 4/23
How Do You Do It? (2) “p ih n s” – probably English… “k r p s t” – probably Czech… Simple HMMs – 5/14 Language Models – 4/30 After Zissman 1996
How Do You Do It (3) Same methods multiple times Acero et al., Chapter 4 4/23 After Zissman 1996
How Do You Do It? (4) Run a complete speech recognizer in each language And we will see several other ways, and combinations! After Zissman 1996
Gauging Progress – The NIST Evaluations • National Institute of Standards and Technology • Has sponsored benchmark tests in multiple language processing areas for over a decade • Topic Detection & Tracking • Content Extraction • Video Analysis • Speech Recognition • Language Identification • Speaker Identification • Machine Translation • http://www.itl.nist.gov/iad/mig/tests/ • Coordination with site funding by Defense Advanced Research Projects Agency (DARPA) • Along with business interest, the driving force in advancing the State-of-the-Art
Language Identification - How Well Can It Be Done – Who Salutes? From NIST 2007 LRE Website
How Well Can it Be Done – What Languages? From NIST 2007 LRE Website
How Well Can It Be Done? – Testing Conditions • 26 languages and dialects • Telephone speech • Multiple duration conditions • 3, 10, 30 seconds • Detection Error Tradeoff (DET) Curves used to measure performance
How Well Can it Be Done – Some Numbers From NIST 2007 LRE Website
Language Identification Project • Build a language ID system with the Call Friend Data set • Implement several of the main techniques • Set up a demo on your laptop that will recognize someone’s language
Flavors of Speaker Recognition Our Focus! From JHU 2002 SuperSID Final Presentation – Reynolds et al.
Speaker Recognition – Why Do It? • Personal Applications • Voice-print passwords • Voicemail transcription – who left that message? • Business Applications • Calling your bank • Government • Is that Osama calling from Pakistan? • Prison call monitoring • Automated parolee calling – is he where you think?
How Do You Do It? The most basic approach: Gaussian Mixture Models - 4/23 More recently: Support vector machines operating on GMMs (!)
How Do You Do It? (2) Also use high-level information! From JHU 2002 SuperSID Final Presentation – Reynolds et al.
How Well Can It Be Done – Who Salutes? From NIST 2008 SRE Presentation, Martin & Greenberg
More Salutes From NIST 2008 SRE Presentation, Martin & Greenberg
From Europe From NIST 2008 SRE Presentation, Martin & Greenberg
More From Europe From NIST 2008 SRE Presentation, Martin & Greenberg
U.S. Entries From NIST 2008 SRE Presentation, Martin & Greenberg
How Well Can It Be Done – Testing Conditions • Conditions for different amounts of data • 10 sec. • 3-5 minutes • 8 minutes • Separate channel and summed channel conditions • English-speakers, non-English speakers, multilingual speakers
Speaker Verification Project • Implement a Speaker-ID system • Template based • GMM based • SVM based • Vector space model • Demonstrate it: • NIST data, e.g. 2001 Evaluation • Your own voice – implement on laptop
Speech Recognition Project • Implement an HMM based recognition system • Use, e.g., Phonebook isolated word data data set or Aurora digit set • Write features with existing front-end • Build your own HMM trainer/decoder • Set it up on your laptop for online word recognition (?!)
Highlights of Syllabus • Required Texts: • Huang, Acero, Hon: Spoken Language Processing • Deng and O’Shaughnessy, Speech Processing • EE516 Reader, at Professional Copy ‘n Print, 4200 University Way • Grading: • Projects: 50% • Final Exam: 30% • Homework 20% • Projects: • Small team or individual • Teams are self-forming • Presentation times TBD • Read ahead & pick an area!!! • Talk to relevant instructor • Suggest deciding no later than 4/30 • Office Hours at end of class and by appointment • Please sign in on email list!