1 / 43

Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jiangfeng Gao Jian Zhang

國立雲林科技大學 National Yunlin University of Science and Technology. On the Use of Words and N-grams for Chinese Information Retrieval. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jiangfeng Gao Jian Zhang Ming Zhou.

belindas
Download Presentation

Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jiangfeng Gao Jian Zhang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 國立雲林科技大學National Yunlin University of Science and Technology • On the Use of Words and N-grams for Chinese Information Retrieval • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Jiangfeng Gao • Jian Zhang • Ming Zhou November 2000 Proceedings of the fifth international workshop on on Information retrieval with Asian languages

  2. Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Chinese Segmentation • Words, characters, longest-matching algorithm, full-segmentation, n-grams, bi-grams, uni-grams, TFIDF • Experiments • Conclusions • Opinion

  3. Motivation • N.Y.U.S.T. • I.M. • words and n-grams have been used • experiments on different way and combine words with n-grams • Accuracy of word segmentation ? • Worthwhile to combine words with n-grams ? time, space performance unknown word

  4. Objective • N.Y.U.S.T. • I.M. • results concerning the relationship between word segmentation. • result n-grams and the performance of Chinese IR • finding a good way to index Chinese texts

  5. 1-1.Introduction • N.Y.U.S.T. • I.M. • has to done to segment sentences into shorter units that may be indexed instance: “美國派團到伊拉克蒐集海珊犯罪證據,除了國際恐怖集團… “ • segment sentences :“美國派團到伊拉克蒐集海珊犯罪證據” • segment term:「美國」「 伊拉克」「蒐集」「海珊」「犯罪」「證據」

  6. 1-2.Introduction • N.Y.U.S.T. • I.M. • Words, characters, longest-matching algorithm, full-segmentation, n-grams, bi-grams, uni-grams,TFIDF • combine words with n-grams

  7. 2.Chinese Segmentation • N.Y.U.S.T. • I.M. • Segmenting a continuous character string into shorter units: • N-grams • Using words

  8. 2-1-1.Segmentation-Grams • N.Y.U.S.T. • I.M. • Grams: • String ABCD, can be segmented into : Uni-grams: A B C D Bi-grams: AB BC CD • cost for indexing in IR much higher as lots more possible units to be considered

  9. 2-1-2.Segmentation-Grams • N.Y.U.S.T. • I.M. • Bi-grams can successfully cover most of the words. “美國派團到伊拉克蒐集海珊犯罪證據” • Average length of words in usage is 1.59 美國 國派 派團 . . 蒐集 集海 海珊

  10. 2-2-1.Segmentation-Words • N.Y.U.S.T. • I.M. • Words: 1.The segmentation of Chinese sentences into words requires linguistic knowledge. • dictionary stores a set of known words • “美國派團到伊拉克蒐集海珊犯罪證據” 2.Heuristic rules on word formation 3.Statistical measures based

  11. 2-2-2.Segmentation-Words • N.Y.U.S.T. • I.M. • Statistical measures based on co-occurrences of characters. • Instance: 一套功能強大之資料探勘元件,能幫助軟體開發者依企業之商業智慧需求,快速地開發出具有強大之資料探勘功能的應用軟體…. • “資料探勘” occurrence frequently

  12. 2-2-3.Segmentation-Words • N.Y.U.S.T. • I.M. • May be combined in different ways • Dictionary combine with heuristic rules • Statistical approach combine with heuristic rules

  13. 2-3.Segmentation • N.Y.U.S.T. • I.M. • There is no one single approach shown down to be clearly superior to the others. • Most approaches can achieve accuracy of over 90%. • A few segmentation errors would not have a sign. impact, • Concern critical words: • 口袋怪獸在全世界魅力無法擋,不但任天堂靠皮卡丘賺進大把鈔票外,連獲得任天堂獨家授權獲得口袋怪獸肖像權的美國公司Forkids Enterinment股票也是連番看漲

  14. 2-4.Segmentation • N.Y.U.S.T. • I.M. • Two kinds of Segmentation ambiguities: 1.combinatory ambiguity string AB (strings or characters) may be considered as a single word, and also be separated into A and B 「操作系統」=>「操作」「系統」 2.overlapping ambiguity string ABC may be segmented as : AB C A BC 「…操作系統整合…」=>「操作系統」「系統整合」

  15. 2-5-1.Segmentation-Longest matching algorithm • N.Y.U.S.T. • I.M. • Longest matching algorithm • two words A and B, combined into AB which is stored in the dictionary. • longer word is well accepted • “操作系統” is a compound word is composed of shorter compound words.

  16. 2-5-2.Segmentation-Longest matching algorithm • N.Y.U.S.T. • I.M. • use Longest matching algorithm : 1.combinatory ambiguity AB 「操作」「系統」=> 「操作系統」 2.overlapping ambiguity forward matching : AB C => 「操作系統」 backward matching : A BC=> 「系統整合」

  17. 2-6.Segmentation-suffix structures • N.Y.U.S.T. • I.M. • Recognized using heuristic rules • Date expressions (e.g. 一九九八年) • Suffix structures (e.g. 使用者) • Dictionary-based segmentation is often complemented by a set of heuristic rules.

  18. 3-1. impacts of segmentation • N.Y.U.S.T. • I.M. • Longest matching approach described more precise meanings • But, if a long word contain several short words, only the long word will be identified. • “操作系統” will be identified, “操作” “系統” will be ignored.

  19. 3-2-1. impacts of segmentation • N.Y.U.S.T. • I.M. 1. full segmentation :extract the short words involved within long words a. consider a sentence as a string b. extract every word that appears at the beginning c. remove first character at the beginning d. repeat until the string is completely removed

  20. 3-2-2. impacts of segmentation • N.Y.U.S.T. • I.M. • 「操作系統」=>「操」「操作」「操作系」 • 「作系統」=>「作」「作系」「作系統」 • 「系統」=>「系」「系統」 • 「統」=>「統」 • Empty • How to produce 「操作系統」「操作」「系統」 ? • combine full-segmentation with longest matching algorithm 操 作 系 統

  21. 3-3. impacts of segmentation • N.Y.U.S.T. • I.M. 2.combine longest words with characters (uni-grams) • reasonable compromise between precision and recall [PIRCS] (Ad-hoc + occurrence + threshold) 口 袋 怪 獸 在 全 世 界 魅 力 無 法 檔 口 :1000 times 口袋 :900 times 口袋怪 :850 times 口袋怪獸:830 times 口袋怪獸在 :5 times

  22. 3-4-1. impacts of segmentation • N.Y.U.S.T. • I.M. 3.combine bi-grams with uni-grams (characters) Some single characters are completely meaningful alone (e.g 造), and forced to combine with another character • may consider unknown words 造紙 造船 造車 造物

  23. 3-4-2. impacts of segmentation • N.Y.U.S.T. • I.M. • unknown word : 大海灣 (not exist in dictionary) • 大,海,灣 ( using uni-grams) • 大海,海灣(using bi-grams) (both bi-grams occur in the same document) 大 :1000 times 海 :1000 times 灣 :1000 times 大海:1000 times 海灣:1000 times

  24. 3-5. impacts of segmentation • N.Y.U.S.T. • I.M. 4.Combine Words and bi-grams • Words : rely on linguistic knowledge • bi-grams: statistical information “貪玩的皮卡丘跑到大城市去逛街“

  25. 3-6. impacts of segmentation • N.Y.U.S.T. • I.M. • create a representation of a text or query • keyword • compound terms • certain combination

  26. 4-1.Experimtents settings • N.Y.U.S.T. • I.M. • Data : TREC Chinese corpus • Documents : • People’s Daily from 1991 to 1993 • a part of the news released by the Xinhua News Agency in 1994 and 1995. • A set of 54 queries has been set up and evaluated by people in the NIST

  27. 4-2-1.Experimtents settings • N.Y.U.S.T. • I.M. • The index result for a document is a vector of weights: • where dik (l<_k<_m) is the weight of the term tk in the document D i, and m is the size of the vector space

  28. 4-2-2.Experimtents settings • N.Y.U.S.T. • I.M. • Use TFIDF, many meaningless bi-grams could be discovered “智慧型行銷資料探勘分析系統- UniMarketing™ 產品. ..探宇科技在資料探勘分析預測具有大的專業經驗,已涵蓋於金融、零售、醫療、製造等領域,所建立之資料探勘模型,經實際導入後成效是十分大的…”

  29. 4-4-3.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • idf(探勘) log(4/1)=2 • idf(大的) log(4/4)=0 D1 探勘:20 大的:20 D2 探勘:0 大的:35 D3 探勘:0 大的:5 D4 探勘:0 大的:21

  30. 4-3.Experimtents settings • N.Y.U.S.T. • I.M. • The indexing result for a query is a vector of weights : • similarity between Di and is calculated

  31. 5.Experimtents • N.Y.U.S.T. • I.M. 1.Longest matching with two dictionaries 2.combining characters (uni-grams) with Longest matching 3.Full segmentation with or without adding characters 4.using bi-grams and characters 5.Combining words with bi-grams and characters 6.Adding an unknown word detection

  32. 5-1. longest matching with two dictionaries • N.Y.U.S.T. • I.M. • Result: • small dictionary :65502 entries average precision of 0.3797 • Large dictionary :220 K entries average precision of 0.3907 • instance: 「作業系統」 • size is not the only one

  33. 5-2.single characters with longest words • N.Y.U.S.T. • I.M. • because of short words included in long words are ignored • Result: • small dictionary :0.4058 (improvement 6.9%) • large dictionary: 0.4290 (improvement 9.8%) • instance:「作業系統」、「作業」、「系統」 • more effective way than increase the size

  34. 5-3.full segmentation • N.Y.U.S.T. • I.M. • extract the short words implied in long words • experiment with the large dictionary • Result: • 0.4090 (performance is higher than Longest matching, but is lower than character with longest • high recall but low precision • instance:「作業系統」、「作業」、「業系」、「系統」 「意外事故」、「意外」、「外事」、「事故」 • combining with single characters is increased to 0.4117

  35. 5-4-1.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • Bi-grams are combined with uni-grams • Average precision is 0.4254 • Many bi-grams are meaningless, especially bi-grams containing functional characters ( e.g. 的) • 「大的」、「小的」、「好的」、「你的」… • 智慧型行銷資料探勘分析系統- UniMarketing™ 產品. ..探宇科技在資料探勘分析預測具有大的專業經驗,已涵蓋於金融、零售、醫療、製造等領域,所建立之資料探勘模型,經實際導入後成效是十分大的…

  36. 5-4-2.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • Use TFIDF, many meaningless bi-grams could be discovered D1 探勘:20 大的:20 D2 探勘:0 大的:35 D3 探勘:0 大的:5 D4 探勘:0 大的:21

  37. 5-4-3.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • The disadvantage of bi-grams with respect to words: • indexing time (2 hours to more than 5 hours) • disc space • larger documents

  38. 5-5.Combine Words with n-grams • N.Y.U.S.T. • I.M. 1.combining (longest-matching + characters) and (bi-grams + characters) average precision is 0.4260 2.combining (full segmentation) and (bi-grams + characters) Average precision is 0.4400

  39. 5-6-1.impact of unknown word detection • N.Y.U.S.T. • I.M. • Unknowd word which is not stored in the dictionary • 「皮納圖博火山」has been segmented as 「皮」「納」「圖」「博」「火」「山」 • 「蜂窩式」has been segmented as 「蜂窩」「式」 • NLP analyer developed in Microsoft – NLPWin • To recognize such unknown words

  40. 5-6-2.impact of unknown word detection • N.Y.U.S.T. • I.M.

  41. 5-7-1.Summary • N.Y.U.S.T. • I.M.

  42. 6.Conclusion • N.Y.U.S.T. • I.M. • achieve performances: words and n-grams • consider the time and space: words( and characters) • exist unknown words: longest- matching with single characters • no effective means to translate English words to Chinese bi-grams

  43. Opinion • N.Y.U.S.T. • I.M. • Information Retrieval

More Related