130 likes | 227 Views
Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words. Gina-Anne Levow University of Chicago July 7, 2003. Roadmap. Goals of expansion Expansion points in CL-SDR Pre- and Post-translation document expansion experiments
E N D
Issues in Pre- and Post-translation Document Expansion:Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003
Roadmap • Goals of expansion • Expansion points in CL-SDR • Pre- and Post-translation document expansion experiments • Task, query & document processing • Expansion methodology • Results • Discussion & Conclusions
Why Expansion? • Recover terms that could have appeared • Compensate for difference in term choice • Author concepts vs searcher information need • Compensate for noisy processing • ASR transcription errors • Misrecognitions, deletions, missegmentations • Translation errors • Gaps, missegmentations • Context disambiguates
Expansion Opportunities • Query: • (Ballesteros & Croft’96; McNamee & Mayfield 2002) • Before, after translation; both • Different enhancements to precision/recall • Pre-translation key – something to translate • European languages • Document • Before, after translation; both • Developed for monolingual SDR (Singhal 1999) • CLIR (+SDR) (Levow & Oard 2000) • Post-translation promising
Experimental Configuration: Basic Task • Variant of Topic Detection and Tracking (TDT) • English queries to Mandarin documents • Query-by-example • English newswire or broadcast news stories • Mandarin audio broadcast news documents • Automatically transcribed by Dragon ASR system • Modifications: • Retrospective retrieval • Evaluation metric: Mean Average Precision
Experimental Configuration:Query and Document Processing • Query: • Select top 180 positively correlated terms in 4 exemplars • Based on Χ^2 test • 996 prior documents assumed not relevant • Document: • Dictionary-based word-for-word translation • Segmentation: NMSU ch_seg • Translation resource: • Merged bilingual term list: CETA & LDC term list • Translation ranking: • Target language unigram frequency: single words, multi-word
Document Expansion: Details • Side collections: • Mandarin: TDT-2 Xinhua, Zaobao newswire • English: TDT-2 New York Times, AP news • Expansion term selection • Top 5 documents • Sort candidate terms by idf • Exclude terms in only one document • Add one term instance per document • Add until document doubled in length
Results • Post-translation significantly outperforms pre-translation expansion
Discussion: Post-translation Effectivenes • Post-translation document expansion significantly improves retrieval effectiveness • Little improvement from pre-translation expans’n • Either alone or in conjunction • Expansion introduces key enriching terms • Named entities, alternate forms • E.g. Tariq Aziz, Saddam, Yeltsin, etc • Available in English (post-translation) collection
Discussion: Pre-translation Limitations • Expansion terms do not exist • Segmentation & transcription rely on term lists • Named entities frequently absent • Can not extract terms from Mandarin newswire • Expansion terms can not translate • Key terms (e.g. named entities) absent from bilingual term lists • All examples on previous page absent
Discussion: Contrasts • Contradict prior query expansion results • Re: Primacy of pre-translation expansion • Explanation: • Prior languages – mostly European • Common writing system, white-space delimited • Pre-translation expansion produces • -> translatable terms + (possibly) untranslatable cognates • Cognates still match, even without translation • Current experiment: English-Mandarin • Untranslatable cognates useless • Different orthography • Terms not identified - missegmentation
Conclusion • Document expansion improves effectiveness • For CL-SDR case, recovers terms lost by missegmentation, mistranscription, or mistranslation; supports different terms • Post-translation expansion most effective • Translated terms provide context for retrieval • Correct translations/transcriptions coherent; others noise • Enriching terms often absent from term lists • Segmentation, transcription, translation all rely on lists • Expansion in indexing language bypasses barriers • Crucial in languages with segmentation issues and different forms