1 / 38

Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Pu-Jen Cheng

國立雲林科技大學 National Yunlin University of Science and Technology. Translating unknown queries with web corpora for cross-language information retrieval. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Pu-Jen Cheng Jei-Wen Teng

nerina
Download Presentation

Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Pu-Jen Cheng

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 國立雲林科技大學National Yunlin University of Science and Technology • Translating unknown queries with web corpora for cross-language information retrieval • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Pu-Jen Cheng • Jei-Wen Teng • Ruei-Cheng Chen • Jenq-Haur Wang • Wen-Hsiang Lu • Lee-Feng Chien Microsoft Research,NLPRS,2001

  2. Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Review approaches • Search-Result Based approach • Term Extraction • Translation Extraction • Experiments • Conclusions • Opinion

  3. Motivation • N.Y.U.S.T. • I.M. • domain-specific corpora might not be applicable • queries with unknown terms • real queries is short CIH, NBA, Disneyland • efficiently translate unknown terms in short queries

  4. Objective • N.Y.U.S.T. • I.M. • Be effective in extracting correct translations of unknown query terms

  5. 1.Introduction • N.Y.U.S.T. • I.M. • Real queries are often short. • 2.3 words in English and 3.18 characters in Chinese • Conventional CLIR approaches might not be applicable to short queries with unknown terms. • Query translations from search-result pages will be encountered • Term extraction-word segmentation • Translation selection-estimate term similarity

  6. 2.Review on Web-Based Approaches • N.Y.U.S.T. • I.M. 1.The parallel-corpus-based approaches • <English,French> <Chinese,English> 2.The comparable-corpus-based approaches • Used a vector-space model 3.The anchor-text-based approach s t

  7. 3.1 Observation • N.Y.U.S.T. • I.M.

  8. 3.2 Considered Problem and Challenge • N.Y.U.S.T. • I.M. • Term segmentation • Semantically-close translations for each unknown query term

  9. 3.3 Term Extraction • N.Y.U.S.T. • I.M. • symmetric conditional probability (SCP) • Concept of context dependency (CD)

  10. 3.3 Term Extraction • N.Y.U.S.T. • I.M. • symmetric conditional probability (SCP) • 賓拉登說了=4 • 賓拉登 =16 賓拉登說了 賓拉登說了 賓拉登 賓拉登 說了(100)

  11. 3.3 Term Extraction • N.Y.U.S.T. • I.M. • 賓拉登說了=1/4*(4*2)+(4*2)+(4*102)+(2*102) • 賓拉登 =1/2*(4*4)+(4*4) 賓拉登說了 賓拉登說了 賓拉登 賓拉登 說了(100)

  12. 3.3 Term Extraction • N.Y.U.S.T. • I.M. • Concept of context dependency (CD) • LC(w1…wn) is the number of unique left adjacent words/characters for the n-gram in the corpus =>捉賓拉登了 捉賓拉 賓拉登 拉登了

  13. 3.3 Term Extraction • N.Y.U.S.T. • I.M. • Concept of context dependency (CD) • 3*3 / 3^2 • 對賓拉登說 • 伊賓拉登將 • 對賓拉登的

  14. 3.4 Translation Extraction • N.Y.U.S.T. • I.M. 1.Chi-square method • It is simpler and depends on the co-occurrences of a query term and its translation candidates on the Web. 2.Context-Vector method • Extracts a so-called context vector as a feature from the search-result pages for each term.

  15. 3.4 Translation Extraction • N.Y.U.S.T. • I.M. 1.Chi-square method

  16. 3.4 Translation Extraction • N.Y.U.S.T. • I.M. 1.Chi-square method

  17. 3.4 Translation Extraction • N.Y.U.S.T. • I.M. 2.Context-Vector Method • share common contextual terms

  18. 3.4 Translation Extraction • N.Y.U.S.T. • I.M. 2.Context-Vector Method

  19. 3.5 The combined Approaches • N.Y.U.S.T. • I.M. • Combining the probabilistic inference model with the context-vector and chi-square methods. • Linear combination weighting scheme

  20. 4. Performance Evaluation • N.Y.U.S.T. • I.M. • Use the Hong Kong Law parallel text collection • 238,236 English-Chinese text paragraphs • Anchor-text • 1,980,816 traditional Chinese web pages collected • Select 109,416 pages contained both Chinese and English terms

  21. 4.1.1 Query Translation • N.Y.U.S.T. • I.M. • Retrieve Chinese documents using English queries

  22. 4.1.1 Query Translation • N.Y.U.S.T. • I.M.

  23. 4. 2 Translation of Web Query Terms • N.Y.U.S.T. • I.M. • Collect web queries from Dreamer and GAIS.

  24. 5. Discussion • N.Y.U.S.T. • I.M. • Flexibility for query specification • It’s difficult to specify ‘correct’ queries, • Translation effectiveness • English-to-Japanese • English-to-Korean • The search-result-based approach does not require crawling the Web and downloading web documents.

  25. 6. Conclusion • N.Y.U.S.T. • I.M. • Translate unknown queries using the dynamic Web as the corpus

  26. Opinion • N.Y.U.S.T. • I.M. • “Bin Laden” occur 9660 • “Bin Laden” and “稱拉登” only 54 (infrequency) • It can extract “賓拉登” or “本拉登”, but “稱拉登”… • We can use more keywords to crawl documents • 恐怖份子、恐怖主義、攻擊美國、阿巴汗……… • recall

  27. Opinion • N.Y.U.S.T. • I.M. <CIH, 病毒,掃毒,防毒 <W, AS,AS,AS W

  28. Opinion • N.Y.U.S.T. • I.M. • Proper name or geography name

  29. Opinion • N.Y.U.S.T. • I.M.

  30. Opinion • N.Y.U.S.T. • I.M. • bin laden • Automatic Translations • 恐怖; 賓拉登; 拉登; 儲藏箱裝載; 奧薩瑪; 恐怖份子; 份子; 攻擊事件; 收藏室裝載; 奧薩馬; 本拉登; 辯論會手稿; 恐怖行動; 下載位址; 箱子裝載; 日星期二; 恐怖大亨; 研究月報; 沙國王; 箱子結滿果實; 美國遭到恐怖份子攻擊; 號行政長官; 收入於箱收藏室裝載;

  31. Opinion • N.Y.U.S.T. • I.M. • pork • Automatic Translations • 豬肉; 產銷; 協會; 政府補助金; 美國; 福利正宗; 官職等; 共有頁數; 正宗; 招牌燒肉; 湖州粽; 福利正宗湖州; 灰色同緣色個; 福利正宗湖州粽; 美食休閒; 齋佬; 我收唔; 正宗湖州; 正宗湖州粽; 休閒; 遊戲物品; 豬扒; 回覆;

  32. Opinion • N.Y.U.S.T. • I.M. • 陳水扁 • Automatic Translations • 陳水扁; 首度約見; 返台後; 羅文嘉; 競選總部; 總統; 陳水扁網路競; 陳水扁總統; 總統陳水扁; 台獨; 夜光新聞; 夜光; 人民網; 專題; 總統就職; 坏萿; 新華網; 新聞網; 相關專題; 陳水扁主持; 分裂國家; 台灣總統陳水扁; 台灣總統;

  33. Opinion • N.Y.U.S.T. • I.M. • Gorbachev • Automatic Translations • 戈爾巴喬夫; 蘇聯; 戈巴契夫; 上誼; 臺灣麥克; 詳細資料; 書林; 集團解體之後; 人民鬥垮; 總統戈爾巴喬夫; 網頁; 書名; 名存實亡; 領導人可以學戈巴契夫; 自從蘇聯; 人民鬥垮自己; 蘇聯集團解體之後; 自從蘇聯集團; 出版日;

  34. Opinion • N.Y.U.S.T. • I.M. • Clinton • Automatic Translations • 柯林頓; 希拉; 克林頓; 幫助; 白宮; 總署柯江; 柯林頓性; 美國總統; 總統; 柯林頓性醜聞; 幫助使用頁; 使用頁; 幫助使用; 幫助目錄; 回函; 體說明; 羅竹茜譯; 說明柯林頓性醜聞笑話; 說明柯林頓性醜聞; 軟體名稱; 很像喔; 憮鴝; 說明柯林頓性; 體說明柯林頓性醜聞; 蝗仵蚺;

  35. Opinion • N.Y.U.S.T. • I.M. • bread • Automatic Translations • 麵包; 靈糧; 每日; 小站; 吐司; 華納; 食物; 麵包合唱團; 網頁; 歡迎祝福; 撰文; 件物品; 鹽巴; 影音; 巧克力麵包; 象徵著平安富足意; 著平安; 人們拿起; 象徵著平安; 俄羅斯人對賓客無上; 發貼; 麵包沾著鹽巴食用; 就象徵著平安富足;

  36. Opinion • N.Y.U.S.T. • I.M. • Michael JORDAN • Automatic Translations • 喬丹; 麥可‧喬丹; 世界; 籃球; 網站; 酷的; 喬登; 麥可喬登; 芝加哥公牛隊; 品牌; 精彩; 米高佐敦; 得分王; 邁爾克; 妣舅; 電腦遊戲; 短片分享; 歷屆得分王; 圖片集; 第一家; 討論區; 新幻社區; 短片分享區; 歷屆得分;

  37. Opinion • N.Y.U.S.T. • I.M. • 餐廳 • Automatic Translations • 餐廳; 流行; 岩燒; 悠浮; 主題; 搜尋; 義大利; 主題餐廳; 餐廳介紹; 特約; 料理; 餐廳搜尋; 關鍵字; 中式餐廳; 義大利餐廳; 特約餐廳; 請輸入; 休閒; 餐廳導覽; 導覽; 餐廳小吃; 藥膳; 台北市;

  38. Opinion • N.Y.U.S.T. • I.M.

More Related