Translation of Web Queries Using Anchor Text Mining

Translation of Web Queries Using Anchor Text Mining Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors :Wen-Hsiang Lu ACM, June 2002

Outline • Motivation • Objective • Introduction • Anchor Text Mining • Probabilistic Inference Model • Query Translation System • Experiments • Discussion • Conclusion • Personal Opinion

Motivation • One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names.

Objective • automatically extracting translations of Web query terms

Introduction • In this paper, we are interested in discovering translations of new terminology and proper names through mining Web anchor texts. • the problems of precious research methods • parallel corpora for various subject and multiple languages • lack of parallel correlation between word pairs • short query terms Yahoo 雅虎雅虎 Yahoo 雅虎搜尋、雅虎.. 美國雅虎

Uj Uj Ui Uj Anchor Text Mining Uj Uj • We use a triple form <Uj,Ui,Dk> to indicate that page Ujpoints to page Uiwith description text Dk. • For a Web page (or URL) Ui, its anchor-text set AT(Ui) is defined as all of the anchor texts of the links pointing to Ui, i.e., Ui ’s inlinks. • For a query term appearing in AT(Ui), it is likely that its corresponding translations also appear together.

Probabilistic Inference Model • asymmetric similarity estimation model • cause some common terms may become the best translations. • symmetric similarity estimation function based on the probabilistic inference model defined first below: where Tt is target translation ; Ts is source term, the inductive rule “if Ts then Tt”, i.e. P( Ts→Tt). the inductive rules “if Ts then Tt” and “if Tt then Ts”, i.e. P( Ts Tt). (2) Total: 100 anchor-text Ts:Yahoo (only one anchor text) ; Tt: 雅虎 (10 anchor text ) 雅虎 Yahoo P( Tt | Ts) = 0.01/ 0.01 = 1 雅虎動物P( Ts Tt ) = 0.01/ [(0.01+0.1)-0.01] = 0.1 雅虎企業 …………. 100

Probabilistic Inference Model (cont.) • Let U=(U1,U2,…,Un) be a concept space (Web page space), consisting of a set of pair-wised disjoint basic concepts (Web pages), i.e., Ui∩Uj = ∅ for i≠j. We can rewrite Eq.(2) as follows: Uj L(Ui) 15 where L(Uj) = the number of in-links of pages Uj

Probabilistic Inference Model (cont.) • We assume that Ts and Tt are independent given Ui; then the joint probability P(Ts∩Tt|Ui) is equal to the product of P(Ts|Ui) and P(Tt|Ui) • the above estimation approach considers the link information and degree of authority among Web pages.

Query Translation System • three different methods to extract Chinese terms: • PAT-tree-based • check if the strings of candidate terms are complete in a lexical boundary • decide the importance of a term, based on its relative frequency • Query-set-based • take queries from search engines • query sets of different sizes • Tagger-based • use the CKIP’s tagger • extract unknown words Yahoo 雅虎雅虎搜尋、雅虎美國雅虎

Experiments • Experimental Environment • Collected popular query terms with the logs from Dreamer and GAIS. • These query terms were taken as the major test set in our term translation extraction analysis. • We filtered out the terms that had no corresponding Chinese translations in the anchor-text database and picked up 622 English terms as the source query set.

Experiments (cont.) • Evaluation Metric • For a set of test query terms, its top-n inclusion rate is defined as the percentage of the query terms whose effective translation (s) can be found in the top n extracted translations.

Experiments (cont.) • Performance with Various Similarity Estimation Models • MA, Asymmetric model as • MAL, Asymmetric model with link information: • MS, Symmetric model as • MSL, Symmetric model with link information as (the proposed model). • 622 English query terms and query-set-based method

Experiments (cont.) • Performance with Various Term Extraction Methods • use MSL as similarity estimation model

Experiments (cont.) • Performance with Various Query-Set Sizes • medium-sized query set achieved the best performance. • Example: "sakura" • 9709 terms:台灣櫻花(Taiwan Sakura Corporation); 櫻花(sakura); 蜘蛛網(spiderweb); 純愛(pure love); and 螢幕保護(screen saving) • 228,566 terms:庫洛魔法使(Card Captor Sakura); 櫻花建設(Sakura Development Corporation); 模仿(imitation); 櫻花大戰(Sakura Wars); 美夕(Miyu, name of an actresss); 台灣櫻花 (Taiwan Sakura Corporation); 櫻花(sakura); 蜘蛛網(spiderweb); 純愛(pure love); and 螢幕保護(screen saving) might also produce more noise

Discussion • Comparisons with a translation lexicon • Queries suitable for finding translations • Extracting domain-specific translations • Experiments on Simplified Chinese pages

Conclusion • proposing a new and effective approach for mining Web link structures and anchor texts for translations of Web query terms. • Future research • combining more in-depth linguistic knowledge to remove noisy terms.

Personal Opinion • ……..

Translation of Web Queries Using Anchor Text Mining