130 likes | 246 Views
Mandarin-English Information (MEI) Johns Hopkins University Summer Workshop 2000. presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese University of Hong Kong Sanjeev Khudanpur Johns Hopkins University Douglas W. Oard University of Maryland
E N D
Mandarin-English Information (MEI)Johns Hopkins University Summer Workshop 2000 presented at the TDT-3 Workshop February 28, 2000 Helen Meng The Chinese University of Hong Kong Sanjeev Khudanpur Johns Hopkins University Douglas W. Oard University of Maryland Hsin-Min Wang Academia Sinica, Taiwan
Outline • Background • The MEI Project • Multiscale Retrieval • Multiscale Translation • Using the TDT-3 collection • Schedule
Motivation • Emerging speech retrieval applications • E.g., http://speechbot.research.compaq.com • Increasing need for translingual audio search • 1896 Internet accessible radio & TV stations • 529 of these (28%) are not in English source: www.real.com
The Big Picture MEI Translingual Audio Search Translingual Audio Browsing Speech to Speech Translation Select Examine English Query English Audio
Related Work • TREC Spoken Document Retrieval • Close coupling of recognition and retrieval • TREC Cross-Language Retrieval • Close coupling of translation and retrieval • TDT-3 • Coupling recognition, translation and retrieval • Using baseline recognizer transcripts
The MEI Project • Closely coupling recognition and translation • For the purpose of retrieval • English text queries, Mandarin news audio • Specific research issues: • Multi-scale retrieval • Multi-scale translation
/j/ /ng/ Preme/Toneme /i/ /a/ /ji/ /ang/ Preme/Core Final /j/ /iang/ Initial/Final Multi-scale Analysis of Mandarin
Multi-scale Retrieval • Subword-scale • Syllable lattice matching [Chen, Wang & Lee, 2000] • Overlapping syllable n-grams [Meng et al., 1999] • Skipped syllable pairs [Chen, Wang & Lee, 2000] • Syllable confusion matrix [Meng et al., 1999] • Word-scale • Structured queries [Pirkola, 1998] • Multi-scale • Unified retrieval using a merged feature set • Scale-optimized retrieval with result-set merging
Why Multi-scale Retrieval? • Word-based retrieval exploits lexical knowledge • Enhances precision • Subword units achieve complete phonological coverage • Enhances recall • Combination of evidence may beat either alone
Multi-scale Translation • Word-scale • Dictionary-based [Levow & Oard, 2000] • Parallel corpora [Nie, 1999] • Comparable corpora [Fung, 1998] • Subword-scale • Cross-language phonetic map [Knight & Graehl, 1997] • /bei2 ai4 er3 lan2/ • Kosovo (/ke1-sou3-wo4/, /ke1-sou3-fo2/, /ke1-sou3-fu1/, /ke1-sou3-fu2/)
Using the TDT-3 Collection • English queries formed from topic descriptions • 2-4 words (simulated Web search) • Full topic description (simulated routing profile) • Mandarin broadcast news audio (121 hours) • Story-boundary-known condition (4624 stories) • Baseline recognizer transcripts provide words
Schedule Six Weeks: Summer Workshop Planning Meeting Second MEI Team Planning Meeting First MEI Team Planning Meeting Dec Feb Apr Jun Aug
Things We Need • Ideas • To sharpen our focus • Connections • To build a community of interest • Resources • To build on what others have done