350 likes | 600 Views
Cross-Lingual Image Search on the Web. Kobi Reiter, Stephen Soderland, Oren Etzioni Turing Center Computer Science and Engineering University of Washington. Limitations to Monolingual Image Search. Limited Resource Languages Slovenian query ‘grenivka’ (grapefruit).
E N D
Cross-Lingual Image Searchon the Web Kobi Reiter, Stephen Soderland, Oren Etzioni Turing Center Computer Science and Engineering University of Washington
Limitations to Monolingual Image Search • Limited Resource LanguagesSlovenian query ‘grenivka’ (grapefruit)
Slovenian query: ‘grevnika’(grapefruit) Only 24 results - 9 show grapefruit
Limitations to Monolingual Image Search • Limited Resource Languages Slovenian query ‘grenivka’ (grapefruit) • Cross-Cultural ImagesSearch for images of ‘food’ in different cultures
English query: ‘food’ Finds hamburgers, hot dogs, etc.
Zulu query: ‘ukudla’ (food) Toto, I don’t think we’re in Kansas anymore.
Limitations to Monolingual Image Search • Limited Resource Languages Slovenian query ‘grenivka’ (grapefruit) • Cross-Cultural Images Search for images of ‘food’ in different cultures • Cross-lingual homonymsSearch for images with Hungarian word for tooth
Hungarian query: ‘fog’ (tooth) Can’t see any teeth in all this fog.
Limitations to Monolingual Image Search • Limited Resource Languages Slovenian query ‘grenivka’ (grapefruit) • Cross-Cultural Images Search for images of ‘food’ in different cultures • Cross-lingual homonyms Search for images with Hungarian word for tooth • Word Sense Ambiguity Search in English for spring (flexible coil)
English query: ‘spring’ Doesn’t find intended sense (flexible coil)
The Solution: PanImages PanImages compiler dictionaries translation graph 1. query PanImages query processor 2. translations translated query user 3. select translation images
Translations of ‘grenivka’ French and English query has over 40,000 images
Translated query for ‘spring’ French ‘ressort’ is unambiguous
Outline of Talk • Overview of PanImages • Building a translation graph • Merging entries from multiple dictionaries • Computing translation probabilities • Image search with a translation graph • Experimental Results • Conclusions
Input from Machine Readable Dictionaries • Multilingual dictionaries: • Each entry has translations in multiple languages • Distinguishes different senses of the word • “Wiktionaries” for 171 languages created by Web volunteers www.wiktionary.org • Esperanto dictionary purl.org/net/voko/revo • Bilingual dictionaries: • Each entry has translations into a single language • May mix together different senses of the word • freedict.org has 64 open source dictionaries
Translation Graph • Nodes in the graph are ordered pairs (w, l) where w is a word in language l • Edges in the graph indicate translations between words • Each edge is labeled with a word sense ID Edges from ‘spring’ from an English dictionary ressort French printemps French 2 1 1 springEnglish 2 … … 2 1 primavera Spanish пружинаRussian
Merging English and French Dictionaries ربيعArabic 3 udaherri Basque printempsFrench 3 … 1 … 3 1 … 1 3 3 1 1 3 … koangaMaori primaveraSpanish 1 springEnglish 2 veer Dutch … рысора Belarusian 2 2 2 4 4 2 … … 4 4 2 … vzmetSlovenian ressort French 4 пружинаRussian 4
Inferring Word Sense Equivalence • Compute where and are word senses • Case1: and are each from dictionaries that • have translations into multiple languages • distinguish word senses • Case2: or are from dictionary that either • have translation into a single language • mix together word senses
Word Sense Equivalence Multilingual dictionaries: is proportional to the degree of overlap between and , where nodes(s) is the set of nodes with edges labeled s.
Word Sense Equivalence Bilingual dictionaries (or not sense distinguished): is high. Estimate this probability empirically. : a triangle from three dictionary entries xuân Vietnamese spring English printemps French
Computing Translation Probabilities • Probability decreases each time the word sense ID changes. • Probability increases with multiple distinct paths. ربيعArabic 3 udaherri Basque printempsFrench 3 … 1 … 3 1 … 1 3 3 1 1 3 … koangaMaori primaveraSpanish 1 springEnglish 2 2 пружина Russian …
Using PanImages: 1:Select language and word Select from 50 source languages automatic word completion
2: Select a Word Sense and Translation Select a word sense Select one or more translations
Graph Statistics • Translation graph from 17 dictionaries: • English Wiktionary: 19,500 words with translations • French Wiktionary: 12,700 words with translations • Esperanto dictionary: 23,000 words with translations • 14 bilingual dictionaries: average 90,000 words each • Graph has: 1.4 million words 957 languages 60 languages have over 1,000 words
Experiment 1: Sense Merging • Graph from 3 dictionaries: • English wiktionary • French wiktionary • Esperanto dictionary • Select random 1,000 English words from graph • Compare number of Hebrew or Russian translations • No sense merging: direct edges in graph • Merging: translation paths where • Measure precision of translations gained by merging
Results of Experiment 1 Sense merging gained: • 90% increase in translations for Hebrew with precision .83 • 79% increase in translations for Russian with precision .68 Recall Gain Merging no merging merged gain Precision Hebrew 80 152 90% 0.83 Russian 503 900 79% 0.68 Error analysis: Strong assumption that dictionaries distinguish word sense. Probability formulas fail when this assumption is violated.
Experiment 2: Image Search • Graph from 3 dictionaries as in Experiment 1 • 10 concepts with distinctive images: ant, clown, fig, lake, sky, train, eat, run, happy, tired • 100 random non-English terms • 10 terms for each concept • Compare results of Google Image search • using non-English term as search query • using PanImages default translation as search query • Metrics: • Number of results • Precision of first page of results
Results of Experiment 2 • 38-fold increase in the number of images returned • Increases precision from .50 to .69 Untranslated query PanImages translation Concept Results Precision Results Precision ant 7,415 0.49 653,000 0.72 clown 91,271 0.80 444,000 1.00 fig 11,905 0.40 3,210,000 0.33 lake 53,826 0.61 3,380,000 1.00 sky 121,373 0.50 2,290,000 0.83 train 34,168 0.84 1,790,000 1.00 eat 28,833 0.51 876,000 0.67 run 30,823 0.23 2,520,000 0.44 happy 37,056 0.21 3,120,000 0.39 tired 65,042 0.44 222,000 0.56 Average 48,171 0.50 1,850,500 0.69
Conclusions • PanImages: a fully-implemented cross-lingual image search system for the Web: www.cs.washington.edu/research/panimages • PanImages boosts recall 38-fold, and raises precision for non-English search queries • We introduced the translation graph • Combines multiple machine readable dictionaries • Probabilistic word-sense merging across dictionaries • Infers translations not found in any source dictionary
Thank you. Demo and poster in afternoon in room CSE 609 Or try it yourself: www.cs.washington.edu/research/panimages
Computing Translation Probabilities • CLISE can find translations that are not in any single source dictionary • Translation probability decreases with each transition to a new word sense ID path P from to …
Probability from Multiple Paths • Probability that is translated as in sense s increases when there are multiple paths between and “Noisy-or” probability model: