180 likes | 313 Views
Combining Query Translation and Document Translation in Cross-Language Retrieval. Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley. CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway.
E N D
Combining Query Translation and Document Translation in Cross-Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems *UC Data Archive & Technical Assistance University of California at Berkeley CLEF 2003 Workshop: 21-22 August, 2003, Trondheim, Norway
Talk Outline • Development of new resources • Fast approximate document translation • Combining query translation and document translation • Conclusions
New Resources • Finnish and Swedish stoplists • Base Finnish and Swedish lexicons for decompounding • Statistical translation lexicons derived from parallel texts • Finnish and Swedish statistical stemmers automatically generated from parallel texts • English spelling normalizer
Development of Swedish Stoplist(by someone who doesn’t know Swedish) Look for Swedish words whose English translations are English stopwords in Swedish textbooks (e.g., grammar) written in English. • en park (a park) • ett piano (a piano) • Jag vet intemycketomhonom (Idon’t know muchabouthim) • efter skolan (after school) • Hans och Greta (Hans and Greta) (Source: Swedish: A comprehensive grammar by P. Holmes & I. Hinchliffe)
Development of Swedish Base Lexicon A base lexicon should contain all and only the words and their variants that are not compounds. • Compile a list of Swedish words (e.g., from the Swedish document collection). • Remove the words that are 4 or fewer characters long. • Remove the long words that can be decomposed into short words in the initial wordlist. animation animationen dator datoranimation datorgrafik datorteknologi datorvirus grafik teknologi virus Remove the compounds that are decomposed. dator animation dator grafik dator teknologi dator virus
Development of Statistical Translation Lexicons from Parallel Texts parallel texts (EU Official Journal) PDFtexts conversion paragraph & sentence alignment statistical association statistical MT toolkit • Italian Spanish • German Italian • Finnish German • English Dutch • English Finnish • English Swedish • Dutch English • Finnish English • Swedish English statistical translation lexicons
Development of Statistical Stemmers “computer” cluster statistical English translations Swedish words dator dator datorn datorer datorersom datornät datornernä informatik dator datorn datorer datorersom datornät datornernä diamanten diamanterna diamanter diamant informatik computer computers computer diamond diamonds diamond “diamond” cluster diamant diamanten diamanterna diamanter diamant
Fast Approximate Document Translation 2 List of Spanish words List of English words Spanish documents 1 Spanish-English MT 3 Word-by-word Bilingual Spanish-English wordlist English translations 4
Query Translation-based Multilingual Retrieval Query Documents IR English English IR French French IR L&H German German IR Spanish Spanish English docs French docs German docs Spanish docs merger combined ranked list of documents
Documentation Translation-based Multilingual Retrieval Documents English Query English French IR English English German English Spanish unified ranked list of documents
Evaluation of Multilingual Retrieval Multilingual-4: English, TD Multilingual-8: English, TD
Query Translation v.s. Document Translation Spanish doc words German doc words English words in topic 161 Diets for Celiacs celíacos dietas diät zöliakie document translation (word-by-word) query translation Las Dietas para Celiacs Nahrungen für Celiacs celiacs diets diets coeliac diseases (Spanish) (German) (English) Average precision: 0.0003 (mul4en1) Average precision: 0.6750 (mul4en2) English words in topic 186 French document words Dutch Netherlands Néerlandais Pays-Bas 0.0 document translation (word-by-word) query translation 1.0 Hollandais Hollande Dutch Netherlands (French) (English) Average precision: 0.2213 (mul4en1) Average precision: 0.6167 (mul4en2)
Manual v.s. Automatic Stemming CLEF 2003 (topic fields: TD. No decompounding or query expansion) CLEF2001-2002 (topic fields: TD. No query expansion)
Evaluation of Decompounding, Stemming and Query Expansion in Monolingual Retrieval Topics (TD) Dutch German Finnish Swedish .5304 (22.16%) .5678 (52.35%) .5633 (48.20%) .5465 (50.55%) decomp+stem+expan .4962 .4804 .5541 .4838 .5126 .5473 .4469 .4880 .4955 .5111 .4972 .4727 decomp+expan stem+expan decomp+stem .4744 .4294 .4204 .4331 .4673 .4867 .4071 .4224 .4480 .4220 .4974 .4121 stem expan decomp .4342 .3727 .3801 .3630 baseline
Conclusions • Fast approximate document-translation worked well. Combining document-translation with query-translation was even better. • Decompounding with stemming and query expansion worked well for languages with rich compounds. • Statistical stemmers derived from parallel texts were not as effective as manually built stemmers for Finnish and Swedish. But there is still room for improving statistical stemmers.
Software Berkeley Text Retrieval System is available for research purpose. Send request to aitao@sims.berkeley.edu