380 likes | 564 Views
國立雲林科技大學 National Yunlin University of Science and Technology. Translating unknown queries with web corpora for cross-language information retrieval. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Pu-Jen Cheng Jei-Wen Teng
E N D
國立雲林科技大學National Yunlin University of Science and Technology • Translating unknown queries with web corpora for cross-language information retrieval • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Pu-Jen Cheng • Jei-Wen Teng • Ruei-Cheng Chen • Jenq-Haur Wang • Wen-Hsiang Lu • Lee-Feng Chien Microsoft Research,NLPRS,2001
Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Review approaches • Search-Result Based approach • Term Extraction • Translation Extraction • Experiments • Conclusions • Opinion
Motivation • N.Y.U.S.T. • I.M. • domain-specific corpora might not be applicable • queries with unknown terms • real queries is short CIH, NBA, Disneyland • efficiently translate unknown terms in short queries
Objective • N.Y.U.S.T. • I.M. • Be effective in extracting correct translations of unknown query terms
1.Introduction • N.Y.U.S.T. • I.M. • Real queries are often short. • 2.3 words in English and 3.18 characters in Chinese • Conventional CLIR approaches might not be applicable to short queries with unknown terms. • Query translations from search-result pages will be encountered • Term extraction-word segmentation • Translation selection-estimate term similarity
2.Review on Web-Based Approaches • N.Y.U.S.T. • I.M. 1.The parallel-corpus-based approaches • <English,French> <Chinese,English> 2.The comparable-corpus-based approaches • Used a vector-space model 3.The anchor-text-based approach s t
3.1 Observation • N.Y.U.S.T. • I.M.
3.2 Considered Problem and Challenge • N.Y.U.S.T. • I.M. • Term segmentation • Semantically-close translations for each unknown query term
3.3 Term Extraction • N.Y.U.S.T. • I.M. • symmetric conditional probability (SCP) • Concept of context dependency (CD)
3.3 Term Extraction • N.Y.U.S.T. • I.M. • symmetric conditional probability (SCP) • 賓拉登說了=4 • 賓拉登 =16 賓拉登說了 賓拉登說了 賓拉登 賓拉登 說了(100)
3.3 Term Extraction • N.Y.U.S.T. • I.M. • 賓拉登說了=1/4*(4*2)+(4*2)+(4*102)+(2*102) • 賓拉登 =1/2*(4*4)+(4*4) 賓拉登說了 賓拉登說了 賓拉登 賓拉登 說了(100)
3.3 Term Extraction • N.Y.U.S.T. • I.M. • Concept of context dependency (CD) • LC(w1…wn) is the number of unique left adjacent words/characters for the n-gram in the corpus =>捉賓拉登了 捉賓拉 賓拉登 拉登了
3.3 Term Extraction • N.Y.U.S.T. • I.M. • Concept of context dependency (CD) • 3*3 / 3^2 • 對賓拉登說 • 伊賓拉登將 • 對賓拉登的
3.4 Translation Extraction • N.Y.U.S.T. • I.M. 1.Chi-square method • It is simpler and depends on the co-occurrences of a query term and its translation candidates on the Web. 2.Context-Vector method • Extracts a so-called context vector as a feature from the search-result pages for each term.
3.4 Translation Extraction • N.Y.U.S.T. • I.M. 1.Chi-square method
3.4 Translation Extraction • N.Y.U.S.T. • I.M. 1.Chi-square method
3.4 Translation Extraction • N.Y.U.S.T. • I.M. 2.Context-Vector Method • share common contextual terms
3.4 Translation Extraction • N.Y.U.S.T. • I.M. 2.Context-Vector Method
3.5 The combined Approaches • N.Y.U.S.T. • I.M. • Combining the probabilistic inference model with the context-vector and chi-square methods. • Linear combination weighting scheme
4. Performance Evaluation • N.Y.U.S.T. • I.M. • Use the Hong Kong Law parallel text collection • 238,236 English-Chinese text paragraphs • Anchor-text • 1,980,816 traditional Chinese web pages collected • Select 109,416 pages contained both Chinese and English terms
4.1.1 Query Translation • N.Y.U.S.T. • I.M. • Retrieve Chinese documents using English queries
4.1.1 Query Translation • N.Y.U.S.T. • I.M.
4. 2 Translation of Web Query Terms • N.Y.U.S.T. • I.M. • Collect web queries from Dreamer and GAIS.
5. Discussion • N.Y.U.S.T. • I.M. • Flexibility for query specification • It’s difficult to specify ‘correct’ queries, • Translation effectiveness • English-to-Japanese • English-to-Korean • The search-result-based approach does not require crawling the Web and downloading web documents.
6. Conclusion • N.Y.U.S.T. • I.M. • Translate unknown queries using the dynamic Web as the corpus
Opinion • N.Y.U.S.T. • I.M. • “Bin Laden” occur 9660 • “Bin Laden” and “稱拉登” only 54 (infrequency) • It can extract “賓拉登” or “本拉登”, but “稱拉登”… • We can use more keywords to crawl documents • 恐怖份子、恐怖主義、攻擊美國、阿巴汗……… • recall
Opinion • N.Y.U.S.T. • I.M. <CIH, 病毒,掃毒,防毒 <W, AS,AS,AS W
Opinion • N.Y.U.S.T. • I.M. • Proper name or geography name
Opinion • N.Y.U.S.T. • I.M.
Opinion • N.Y.U.S.T. • I.M. • bin laden • Automatic Translations • 恐怖; 賓拉登; 拉登; 儲藏箱裝載; 奧薩瑪; 恐怖份子; 份子; 攻擊事件; 收藏室裝載; 奧薩馬; 本拉登; 辯論會手稿; 恐怖行動; 下載位址; 箱子裝載; 日星期二; 恐怖大亨; 研究月報; 沙國王; 箱子結滿果實; 美國遭到恐怖份子攻擊; 號行政長官; 收入於箱收藏室裝載;
Opinion • N.Y.U.S.T. • I.M. • pork • Automatic Translations • 豬肉; 產銷; 協會; 政府補助金; 美國; 福利正宗; 官職等; 共有頁數; 正宗; 招牌燒肉; 湖州粽; 福利正宗湖州; 灰色同緣色個; 福利正宗湖州粽; 美食休閒; 齋佬; 我收唔; 正宗湖州; 正宗湖州粽; 休閒; 遊戲物品; 豬扒; 回覆;
Opinion • N.Y.U.S.T. • I.M. • 陳水扁 • Automatic Translations • 陳水扁; 首度約見; 返台後; 羅文嘉; 競選總部; 總統; 陳水扁網路競; 陳水扁總統; 總統陳水扁; 台獨; 夜光新聞; 夜光; 人民網; 專題; 總統就職; 坏萿; 新華網; 新聞網; 相關專題; 陳水扁主持; 分裂國家; 台灣總統陳水扁; 台灣總統;
Opinion • N.Y.U.S.T. • I.M. • Gorbachev • Automatic Translations • 戈爾巴喬夫; 蘇聯; 戈巴契夫; 上誼; 臺灣麥克; 詳細資料; 書林; 集團解體之後; 人民鬥垮; 總統戈爾巴喬夫; 網頁; 書名; 名存實亡; 領導人可以學戈巴契夫; 自從蘇聯; 人民鬥垮自己; 蘇聯集團解體之後; 自從蘇聯集團; 出版日;
Opinion • N.Y.U.S.T. • I.M. • Clinton • Automatic Translations • 柯林頓; 希拉; 克林頓; 幫助; 白宮; 總署柯江; 柯林頓性; 美國總統; 總統; 柯林頓性醜聞; 幫助使用頁; 使用頁; 幫助使用; 幫助目錄; 回函; 體說明; 羅竹茜譯; 說明柯林頓性醜聞笑話; 說明柯林頓性醜聞; 軟體名稱; 很像喔; 憮鴝; 說明柯林頓性; 體說明柯林頓性醜聞; 蝗仵蚺;
Opinion • N.Y.U.S.T. • I.M. • bread • Automatic Translations • 麵包; 靈糧; 每日; 小站; 吐司; 華納; 食物; 麵包合唱團; 網頁; 歡迎祝福; 撰文; 件物品; 鹽巴; 影音; 巧克力麵包; 象徵著平安富足意; 著平安; 人們拿起; 象徵著平安; 俄羅斯人對賓客無上; 發貼; 麵包沾著鹽巴食用; 就象徵著平安富足;
Opinion • N.Y.U.S.T. • I.M. • Michael JORDAN • Automatic Translations • 喬丹; 麥可‧喬丹; 世界; 籃球; 網站; 酷的; 喬登; 麥可喬登; 芝加哥公牛隊; 品牌; 精彩; 米高佐敦; 得分王; 邁爾克; 妣舅; 電腦遊戲; 短片分享; 歷屆得分王; 圖片集; 第一家; 討論區; 新幻社區; 短片分享區; 歷屆得分;
Opinion • N.Y.U.S.T. • I.M. • 餐廳 • Automatic Translations • 餐廳; 流行; 岩燒; 悠浮; 主題; 搜尋; 義大利; 主題餐廳; 餐廳介紹; 特約; 料理; 餐廳搜尋; 關鍵字; 中式餐廳; 義大利餐廳; 特約餐廳; 請輸入; 休閒; 餐廳導覽; 導覽; 餐廳小吃; 藥膳; 台北市;
Opinion • N.Y.U.S.T. • I.M.