Mandarin–English Information (MEI): investigating translingual speech retrieval

Mandarin–English Information (MEI): investigatingtranslingual speech retrieval Helen M. Meng a,*, Berlin Chen b, Sanjeev Khudanpur c, Gina-Anne Levow d, Wai-Kit Lo a, Douglas Oard d, Patrick Schone e, Karen Tang f, Hsin-min Wang b, Jianqiang Wang d a Department of Systems Engineering and Engineering Management, Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Shatin, NT, Hong Kong b Academia Sinica, Taiwan c Johns Hopkins University, USA d University of Maryland at College Park, USA e Department of Defense, USA f Princeton University, USA Computer Speech and Language(2004) Received 30 May 2001; received in revised form 8 August 2003; accepted 15 September 2003

Outline • Abstract • Introduction • Previous work • The TDT collection • The multi-scale paradigm • Multi-scale query formulation • Query term select • Query translation • Named entity transliteraion • Multi-scale query construction • Multi-scale audio indexing • Multi-scale retrieval • Experiments • Evaluation criterion • Tuning with the development test set • Experimental results • Phrase-based translation • Multi-scale retrieval • Subword translation • Conclusions

Abstract • First English–Chinese CL-SDR system • The Mandarin–English Information (MEI) project investigated the problem of cross-language spoken document retrieval (CL-SDR). • cross-language / cross-media retrieval task • English news story (text) as query, and retrieves relevant Chinese broadcast news stories (audio) from the document collection. • English queries translated into Chinese • by dictionary-based approach • phrase-based translation with word-by-word translation • Untranslatable named entities are transliterated by a novel subword translation technique

Introduction • multi-scale paradigm for English–Chinese CL-SDR • Phrases ,words ,subwords (Chinese characters and syllables) • problems related to English–Chinese CL-SDR • Multiplicity in translation: • multiple translation alternatives • no translations (e.g., for proper names) • Ambiguity in Chinese homophones: (義異同音) • word-level confusions : ex.,富庶,負數,複數,覆述 • Ambiguity in Chinese word tokenization: • word-level mismatches between queries and documents • Example: 這一晚會如常舉行 , 這一晚會如常舉行 , 這一晚會如常舉行 . • Speech recognition errors: • OOV (out of vocabulary) words • acoustic confusions among in-vocabulary words

Previous Work • Retrieval of German spoken news documents with French text (Sheridan and Ballerini, 1996) • Retrieval of Serbo-Croatian news broadcasts with English text (Hauptmann et al., 1998) • CL-SDR integrates speech recognition, machine translation and information retrieval technologies to accomplish the task. • CLIR to cross language barrier for retrieval: query translation, document translation, interlingual techniques and cognate matching • Buckley et al. (1997) performed English–French CLIR • spoken document retrieval, a popular approach (Garofolo et al., 2000) • (OOV) problem in recognition (Woodland et al., 2000), • indexed with phoneme n-grams (Ng, 2000; Wechsler and Schauble, 1995)

The TDT collection • Using the Topic Detection and Tracking (TDT) collection for this work. • TDT is a DARPA sponsored program where participating sites tackle tasks such as • identifying the first time a story is reported on a given topic • grouping similar topics from audio and textual streams of newswire data. • In recent years, TDT has focused primarily on performing such tasks in both English and Mandarin Chinese. • Most of the Mandarin audio data are transcribed by the Dragon automatic speech recognition system.

The TDT collection • Using the TDT-2 corpus as our development test set, and TDT-3 as our evaluation test set.

The multi-scale paradigm- Multi-scale query formulation (1/6) • Query term selection: • In the MEI task : • The query consists of an entire English news story. • Queries tend to be long, and not all query terms are important for retrieval. • First excluded all stopwords. • List all of the terms in the exemplar • For multi-word used a test in a manner to select these terms. • N :number of documents • r is relevant, n is non-relevant documents • + means the current terms appears in the document, and - indicates that it does not.

The multi-scale paradigm- Multi-scale query formulation (2/6) • Query translation : • Bilingual term list (BTL) • The list is formed by combining LDCs English–Chinese bilingual term list with translations extracted from the CETA (Chinese-English Translation Assistance) dictionaries. • BTL covers 200,000 distinct English terms. • Among these English terms, some have multiple translations and there is a total of approximately 400,000 English-to-Chinese translation pairs. • This approach preserves term frequency information in the query. • Translation proceeds on the phrasal scale, word scale, as well as the subword scale.

The multi-scale paradigm- Multi-scale query formulation (3/6) • Named entity transliteration : • BTL will inevitably be untranslatable terms (names of people, places, locations and organizations) • Named entities need to be salvaged since they tend to be important for retrieval. • For example ‘‘Kosovo’’ : • 科索沃 /ke-suo-wo/ • 科索佛 /ke-suo-fo/ • 科索夫 /ke-suo-fu/ • 科索伏 /ke-suo-fu/ • 柯索佛 /ke-suo-fo/ • subword translation • We applied our knowledge in acoustic–phonetics and phonology related to both English and Chinese. • Also applied machine learning techniques and other techniques used in speech recognition.

The multi-scale paradigm- Multi-scale query formulation (4/6) Query exemplars that are tagged by the BBN Identifinder system, and those absent from our BTL are processed by our transliteration system. Chinese names are often represented in English by “pinyin” transcription. Using letter-to-sound rules acquired by the transformation-based error-driving learning technique. Hand-designing a set of cross-lingual phonological rules that partially transforms an English pronunciation into a Chinese pronunciation.

The multi-scale paradigm- Multi-scale query formulation (5/6) Hypothesized syllable sequence is generated from the English name spelling by our transliteration procedure. Reference syllable sequence is obtained by pronunciation lexicon lookup based on the Chinese name characters.

The multi-scale paradigm- Multi-scale query formulation (6/6) • Multi-scale query construction • Query construction process: • bag of English query terms • Multi-scale query construction • translated phrases, • named entities, • individual translated words • translated syllables.

The multi-scale paradigm- Multi-scale audio indexing • The Dragon large-vocabulary continuous speech recognizer (Zhan et al., 1999) provided Chinese word transcriptions for our Mandarin audio collections (TDT-2 and TDT-3). • For a fraction of the TDT-2 test set (about 23 h) • word error rates : 18.0% • character error rates : 12.1% • syllable error rates : 7.9% • For a fraction of the TDT-3 test set (about 27 h) • word error rates :19.1% • character error rates :13.0% • syllable error rates : 8.6%

The multi-scale paradigm- Multi-scale retrieval (1/2) • InQuery retrieval engine developed by the University of Massachusetts (Callan et al., 1992). • Using a probabilisticbelief network as the main data structure behind its query language. • Key feature is the ‘‘balanced query’’ mechanism. • Suppose English query E1; E2; . . . ; En • E1 has three possible Chinese translations, C11, C12 and C13. • With balanced translation,belief value for E1 in the Chinese document = the mean of the belief values for C11, C12 and C13 in that document. • Repeating the process for additional terms produces a set of belief values foreach English query termwith respect toevery Chinese document.

The multi-scale paradigm- Multi-scale retrieval (2/2) • Balanced translation of the query would be represented as #sum(#sum(C11;C12; C13)#sum(C21;C22)...#sum(Cn1; Cn2;Cn3)) • The outer #sum operator being the typical way of combining belief values across query terms. • The inner #sum operators implementing balanced translation. • Multi-scale retrieval proceeds : • each scale individually • each scale produces its own ranked list of documents • Word scale produces • characters scale produces • Syllables scale produces

Experiments • Evaluation Criterion : • Lis the number of topics • Miis a sample of the exemplars for topic i • Niis number of relevant documents for topic i • rankijkis the rank of the kth relevant document retrieved for exemplar j of topic i

Experiments Based on a paired two-tailed t test on the means across exemplars of each topic, with p < 0.05.

Experiments

Conclusion • English news story (text) as query, and retrieves relevant Chinese broadcast news stories (audio) from the document collection.This is a cross-language and cross-media retrieval task. • The experimented with the TDT collections, which have English newswire from New York Times and Associated Press, and Mandarin Chinese radio news broadcasts from Voice of America. The radio news is transcribed by Dragons large-vocabulary continuous speech recognizer. • Multi-scale approach is promising and applicable to the English–Chinese CL-SDR task. It should also be possible to leverage off of our experience in a translingual setting, which involves SDR across other language pairs.

Mandarin–English Information (MEI): investigating translingual speech retrieval

Mandarin–English Information (MEI): investigating translingual speech retrieval

Presentation Transcript

INFO624 -- Week 9 Effective Information Retrieval

We the News Investigating Blog Punditry

Information Retrieval to Knowledge Retrieval , one more step

Why Inner Speech?

Information Retrieval

Text Information Retrieval and Applications – Advanced Topics

Memory

Parts-of-speech English-German

Text-retrieval Systems

Mandarin II 中文 II E Block Course Objectives Students will:

Mandarin II 中文 II E Block Course Objectives Students will:

Information Retrieval and Recommendation Techniques

Feature Extraction for speech applications

Information Retrieval and Search Engines

Information Retrieval and Search Engines

Conditional Random Fields for Automatic Speech Recognition

Modeling the Internet and the Web: Text Analysis

Text information storage and retrieval and the CDS/ISIS program