1 / 13

Gina-Anne Levow University of Chicago July 7, 2003

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words. Gina-Anne Levow University of Chicago July 7, 2003. Roadmap. Goals of expansion Expansion points in CL-SDR Pre- and Post-translation document expansion experiments

hiroko
Download Presentation

Gina-Anne Levow University of Chicago July 7, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues in Pre- and Post-translation Document Expansion:Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003

  2. Roadmap • Goals of expansion • Expansion points in CL-SDR • Pre- and Post-translation document expansion experiments • Task, query & document processing • Expansion methodology • Results • Discussion & Conclusions

  3. Why Expansion? • Recover terms that could have appeared • Compensate for difference in term choice • Author concepts vs searcher information need • Compensate for noisy processing • ASR transcription errors • Misrecognitions, deletions, missegmentations • Translation errors • Gaps, missegmentations • Context disambiguates

  4. Expansion Opportunities • Query: • (Ballesteros & Croft’96; McNamee & Mayfield 2002) • Before, after translation; both • Different enhancements to precision/recall • Pre-translation key – something to translate • European languages • Document • Before, after translation; both • Developed for monolingual SDR (Singhal 1999) • CLIR (+SDR) (Levow & Oard 2000) • Post-translation promising

  5. Experimental Configuration: Basic Task • Variant of Topic Detection and Tracking (TDT) • English queries to Mandarin documents • Query-by-example • English newswire or broadcast news stories • Mandarin audio broadcast news documents • Automatically transcribed by Dragon ASR system • Modifications: • Retrospective retrieval • Evaluation metric: Mean Average Precision

  6. Experimental Configuration:Query and Document Processing • Query: • Select top 180 positively correlated terms in 4 exemplars • Based on Χ^2 test • 996 prior documents assumed not relevant • Document: • Dictionary-based word-for-word translation • Segmentation: NMSU ch_seg • Translation resource: • Merged bilingual term list: CETA & LDC term list • Translation ranking: • Target language unigram frequency: single words, multi-word

  7. Experimental Configuration:Document Expansion

  8. Document Expansion: Details • Side collections: • Mandarin: TDT-2 Xinhua, Zaobao newswire • English: TDT-2 New York Times, AP news • Expansion term selection • Top 5 documents • Sort candidate terms by idf • Exclude terms in only one document • Add one term instance per document • Add until document doubled in length

  9. Results • Post-translation significantly outperforms pre-translation expansion

  10. Discussion: Post-translation Effectivenes • Post-translation document expansion significantly improves retrieval effectiveness • Little improvement from pre-translation expans’n • Either alone or in conjunction • Expansion introduces key enriching terms • Named entities, alternate forms • E.g. Tariq Aziz, Saddam, Yeltsin, etc • Available in English (post-translation) collection

  11. Discussion: Pre-translation Limitations • Expansion terms do not exist • Segmentation & transcription rely on term lists • Named entities frequently absent • Can not extract terms from Mandarin newswire • Expansion terms can not translate • Key terms (e.g. named entities) absent from bilingual term lists • All examples on previous page absent

  12. Discussion: Contrasts • Contradict prior query expansion results • Re: Primacy of pre-translation expansion • Explanation: • Prior languages – mostly European • Common writing system, white-space delimited • Pre-translation expansion produces • -> translatable terms + (possibly) untranslatable cognates • Cognates still match, even without translation • Current experiment: English-Mandarin • Untranslatable cognates useless • Different orthography • Terms not identified - missegmentation

  13. Conclusion • Document expansion improves effectiveness • For CL-SDR case, recovers terms lost by missegmentation, mistranscription, or mistranslation; supports different terms • Post-translation expansion most effective • Translated terms provide context for retrieval • Correct translations/transcriptions coherent; others noise • Enriching terms often absent from term lists • Segmentation, transcription, translation all rely on lists • Expansion in indexing language bypasses barriers • Crucial in languages with segmentation issues and different forms

More Related