180 likes | 187 Views
This paper discusses the use of anchor text mining to automatically extract translations of web query terms, addressing the lack of appropriate translations in cross-language information retrieval and web search. The paper introduces a probabilistic inference model and a query translation system, and presents experimental results and discussions.
E N D
Translation of Web Queries Using Anchor Text Mining Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors :Wen-Hsiang Lu ACM, June 2002
Outline • Motivation • Objective • Introduction • Anchor Text Mining • Probabilistic Inference Model • Query Translation System • Experiments • Discussion • Conclusion • Personal Opinion
Motivation • One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names.
Objective • automatically extracting translations of Web query terms
Introduction • In this paper, we are interested in discovering translations of new terminology and proper names through mining Web anchor texts. • the problems of precious research methods • parallel corpora for various subject and multiple languages • lack of parallel correlation between word pairs • short query terms Yahoo 雅虎 雅虎 Yahoo 雅虎 搜尋、雅虎.. 美國雅虎
Uj Uj Ui Uj Anchor Text Mining Uj Uj • We use a triple form <Uj,Ui,Dk> to indicate that page Ujpoints to page Uiwith description text Dk. • For a Web page (or URL) Ui, its anchor-text set AT(Ui) is defined as all of the anchor texts of the links pointing to Ui, i.e., Ui ’s inlinks. • For a query term appearing in AT(Ui), it is likely that its corresponding translations also appear together.
Probabilistic Inference Model • asymmetric similarity estimation model • cause some common terms may become the best translations. • symmetric similarity estimation function based on the probabilistic inference model defined first below: where Tt is target translation ; Ts is source term, the inductive rule “if Ts then Tt”, i.e. P( Ts→Tt). the inductive rules “if Ts then Tt” and “if Tt then Ts”, i.e. P( Ts Tt). (2) Total: 100 anchor-text Ts:Yahoo (only one anchor text) ; Tt: 雅虎 (10 anchor text ) 雅虎 Yahoo P( Tt | Ts) = 0.01/ 0.01 = 1 雅虎 動物P( Ts Tt ) = 0.01/ [(0.01+0.1)-0.01] = 0.1 雅虎 企業 …………. 100
Probabilistic Inference Model (cont.) • Let U=(U1,U2,…,Un) be a concept space (Web page space), consisting of a set of pair-wised disjoint basic concepts (Web pages), i.e., Ui∩Uj = ∅ for i≠j. We can rewrite Eq.(2) as follows: Uj L(Ui) 15 where L(Uj) = the number of in-links of pages Uj
Probabilistic Inference Model (cont.) • We assume that Ts and Tt are independent given Ui; then the joint probability P(Ts∩Tt|Ui) is equal to the product of P(Ts|Ui) and P(Tt|Ui) • the above estimation approach considers the link information and degree of authority among Web pages.
Query Translation System • three different methods to extract Chinese terms: • PAT-tree-based • check if the strings of candidate terms are complete in a lexical boundary • decide the importance of a term, based on its relative frequency • Query-set-based • take queries from search engines • query sets of different sizes • Tagger-based • use the CKIP’s tagger • extract unknown words Yahoo 雅虎 雅虎 搜尋、雅虎 美國雅虎
Experiments • Experimental Environment • Collected popular query terms with the logs from Dreamer and GAIS. • These query terms were taken as the major test set in our term translation extraction analysis. • We filtered out the terms that had no corresponding Chinese translations in the anchor-text database and picked up 622 English terms as the source query set.
Experiments (cont.) • Evaluation Metric • For a set of test query terms, its top-n inclusion rate is defined as the percentage of the query terms whose effective translation (s) can be found in the top n extracted translations.
Experiments (cont.) • Performance with Various Similarity Estimation Models • MA, Asymmetric model as • MAL, Asymmetric model with link information: • MS, Symmetric model as • MSL, Symmetric model with link information as (the proposed model). • 622 English query terms and query-set-based method
Experiments (cont.) • Performance with Various Term Extraction Methods • use MSL as similarity estimation model
Experiments (cont.) • Performance with Various Query-Set Sizes • medium-sized query set achieved the best performance. • Example: "sakura" • 9709 terms:台灣櫻花(Taiwan Sakura Corporation); 櫻花(sakura); 蜘蛛網(spiderweb); 純愛(pure love); and 螢幕保護(screen saving) • 228,566 terms:庫洛魔法使(Card Captor Sakura); 櫻花建設(Sakura Development Corporation); 模仿(imitation); 櫻花大戰(Sakura Wars); 美夕(Miyu, name of an actresss); 台灣櫻花 (Taiwan Sakura Corporation); 櫻花(sakura); 蜘蛛網(spiderweb); 純愛(pure love); and 螢幕保護(screen saving) might also produce more noise
Discussion • Comparisons with a translation lexicon • Queries suitable for finding translations • Extracting domain-specific translations • Experiments on Simplified Chinese pages
Conclusion • proposing a new and effective approach for mining Web link structures and anchor texts for translations of Web query terms. • Future research • combining more in-depth linguistic knowledge to remove noisy terms.
Personal Opinion • ……..