430 likes | 439 Views
國立雲林科技大學 National Yunlin University of Science and Technology. On the Use of Words and N-grams for Chinese Information Retrieval. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jiangfeng Gao Jian Zhang Ming Zhou.
E N D
國立雲林科技大學National Yunlin University of Science and Technology • On the Use of Words and N-grams for Chinese Information Retrieval • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Jiangfeng Gao • Jian Zhang • Ming Zhou November 2000 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Chinese Segmentation • Words, characters, longest-matching algorithm, full-segmentation, n-grams, bi-grams, uni-grams, TFIDF • Experiments • Conclusions • Opinion
Motivation • N.Y.U.S.T. • I.M. • words and n-grams have been used • experiments on different way and combine words with n-grams • Accuracy of word segmentation ? • Worthwhile to combine words with n-grams ? time, space performance unknown word
Objective • N.Y.U.S.T. • I.M. • results concerning the relationship between word segmentation. • result n-grams and the performance of Chinese IR • finding a good way to index Chinese texts
1-1.Introduction • N.Y.U.S.T. • I.M. • has to done to segment sentences into shorter units that may be indexed instance: “美國派團到伊拉克蒐集海珊犯罪證據,除了國際恐怖集團… “ • segment sentences :“美國派團到伊拉克蒐集海珊犯罪證據” • segment term:「美國」「 伊拉克」「蒐集」「海珊」「犯罪」「證據」
1-2.Introduction • N.Y.U.S.T. • I.M. • Words, characters, longest-matching algorithm, full-segmentation, n-grams, bi-grams, uni-grams,TFIDF • combine words with n-grams
2.Chinese Segmentation • N.Y.U.S.T. • I.M. • Segmenting a continuous character string into shorter units: • N-grams • Using words
2-1-1.Segmentation-Grams • N.Y.U.S.T. • I.M. • Grams: • String ABCD, can be segmented into : Uni-grams: A B C D Bi-grams: AB BC CD • cost for indexing in IR much higher as lots more possible units to be considered
2-1-2.Segmentation-Grams • N.Y.U.S.T. • I.M. • Bi-grams can successfully cover most of the words. “美國派團到伊拉克蒐集海珊犯罪證據” • Average length of words in usage is 1.59 美國 國派 派團 . . 蒐集 集海 海珊
2-2-1.Segmentation-Words • N.Y.U.S.T. • I.M. • Words: 1.The segmentation of Chinese sentences into words requires linguistic knowledge. • dictionary stores a set of known words • “美國派團到伊拉克蒐集海珊犯罪證據” 2.Heuristic rules on word formation 3.Statistical measures based
2-2-2.Segmentation-Words • N.Y.U.S.T. • I.M. • Statistical measures based on co-occurrences of characters. • Instance: 一套功能強大之資料探勘元件,能幫助軟體開發者依企業之商業智慧需求,快速地開發出具有強大之資料探勘功能的應用軟體…. • “資料探勘” occurrence frequently
2-2-3.Segmentation-Words • N.Y.U.S.T. • I.M. • May be combined in different ways • Dictionary combine with heuristic rules • Statistical approach combine with heuristic rules
2-3.Segmentation • N.Y.U.S.T. • I.M. • There is no one single approach shown down to be clearly superior to the others. • Most approaches can achieve accuracy of over 90%. • A few segmentation errors would not have a sign. impact, • Concern critical words: • 口袋怪獸在全世界魅力無法擋,不但任天堂靠皮卡丘賺進大把鈔票外,連獲得任天堂獨家授權獲得口袋怪獸肖像權的美國公司Forkids Enterinment股票也是連番看漲
2-4.Segmentation • N.Y.U.S.T. • I.M. • Two kinds of Segmentation ambiguities: 1.combinatory ambiguity string AB (strings or characters) may be considered as a single word, and also be separated into A and B 「操作系統」=>「操作」「系統」 2.overlapping ambiguity string ABC may be segmented as : AB C A BC 「…操作系統整合…」=>「操作系統」「系統整合」
2-5-1.Segmentation-Longest matching algorithm • N.Y.U.S.T. • I.M. • Longest matching algorithm • two words A and B, combined into AB which is stored in the dictionary. • longer word is well accepted • “操作系統” is a compound word is composed of shorter compound words.
2-5-2.Segmentation-Longest matching algorithm • N.Y.U.S.T. • I.M. • use Longest matching algorithm : 1.combinatory ambiguity AB 「操作」「系統」=> 「操作系統」 2.overlapping ambiguity forward matching : AB C => 「操作系統」 backward matching : A BC=> 「系統整合」
2-6.Segmentation-suffix structures • N.Y.U.S.T. • I.M. • Recognized using heuristic rules • Date expressions (e.g. 一九九八年) • Suffix structures (e.g. 使用者) • Dictionary-based segmentation is often complemented by a set of heuristic rules.
3-1. impacts of segmentation • N.Y.U.S.T. • I.M. • Longest matching approach described more precise meanings • But, if a long word contain several short words, only the long word will be identified. • “操作系統” will be identified, “操作” “系統” will be ignored.
3-2-1. impacts of segmentation • N.Y.U.S.T. • I.M. 1. full segmentation :extract the short words involved within long words a. consider a sentence as a string b. extract every word that appears at the beginning c. remove first character at the beginning d. repeat until the string is completely removed
3-2-2. impacts of segmentation • N.Y.U.S.T. • I.M. • 「操作系統」=>「操」「操作」「操作系」 • 「作系統」=>「作」「作系」「作系統」 • 「系統」=>「系」「系統」 • 「統」=>「統」 • Empty • How to produce 「操作系統」「操作」「系統」 ? • combine full-segmentation with longest matching algorithm 操 作 系 統
3-3. impacts of segmentation • N.Y.U.S.T. • I.M. 2.combine longest words with characters (uni-grams) • reasonable compromise between precision and recall [PIRCS] (Ad-hoc + occurrence + threshold) 口 袋 怪 獸 在 全 世 界 魅 力 無 法 檔 口 :1000 times 口袋 :900 times 口袋怪 :850 times 口袋怪獸:830 times 口袋怪獸在 :5 times
3-4-1. impacts of segmentation • N.Y.U.S.T. • I.M. 3.combine bi-grams with uni-grams (characters) Some single characters are completely meaningful alone (e.g 造), and forced to combine with another character • may consider unknown words 造紙 造船 造車 造物
3-4-2. impacts of segmentation • N.Y.U.S.T. • I.M. • unknown word : 大海灣 (not exist in dictionary) • 大,海,灣 ( using uni-grams) • 大海,海灣(using bi-grams) (both bi-grams occur in the same document) 大 :1000 times 海 :1000 times 灣 :1000 times 大海:1000 times 海灣:1000 times
3-5. impacts of segmentation • N.Y.U.S.T. • I.M. 4.Combine Words and bi-grams • Words : rely on linguistic knowledge • bi-grams: statistical information “貪玩的皮卡丘跑到大城市去逛街“
3-6. impacts of segmentation • N.Y.U.S.T. • I.M. • create a representation of a text or query • keyword • compound terms • certain combination
4-1.Experimtents settings • N.Y.U.S.T. • I.M. • Data : TREC Chinese corpus • Documents : • People’s Daily from 1991 to 1993 • a part of the news released by the Xinhua News Agency in 1994 and 1995. • A set of 54 queries has been set up and evaluated by people in the NIST
4-2-1.Experimtents settings • N.Y.U.S.T. • I.M. • The index result for a document is a vector of weights: • where dik (l<_k<_m) is the weight of the term tk in the document D i, and m is the size of the vector space
4-2-2.Experimtents settings • N.Y.U.S.T. • I.M. • Use TFIDF, many meaningless bi-grams could be discovered “智慧型行銷資料探勘分析系統- UniMarketing™ 產品. ..探宇科技在資料探勘分析預測具有大的專業經驗,已涵蓋於金融、零售、醫療、製造等領域,所建立之資料探勘模型,經實際導入後成效是十分大的…”
4-4-3.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • idf(探勘) log(4/1)=2 • idf(大的) log(4/4)=0 D1 探勘:20 大的:20 D2 探勘:0 大的:35 D3 探勘:0 大的:5 D4 探勘:0 大的:21
4-3.Experimtents settings • N.Y.U.S.T. • I.M. • The indexing result for a query is a vector of weights : • similarity between Di and is calculated
5.Experimtents • N.Y.U.S.T. • I.M. 1.Longest matching with two dictionaries 2.combining characters (uni-grams) with Longest matching 3.Full segmentation with or without adding characters 4.using bi-grams and characters 5.Combining words with bi-grams and characters 6.Adding an unknown word detection
5-1. longest matching with two dictionaries • N.Y.U.S.T. • I.M. • Result: • small dictionary :65502 entries average precision of 0.3797 • Large dictionary :220 K entries average precision of 0.3907 • instance: 「作業系統」 • size is not the only one
5-2.single characters with longest words • N.Y.U.S.T. • I.M. • because of short words included in long words are ignored • Result: • small dictionary :0.4058 (improvement 6.9%) • large dictionary: 0.4290 (improvement 9.8%) • instance:「作業系統」、「作業」、「系統」 • more effective way than increase the size
5-3.full segmentation • N.Y.U.S.T. • I.M. • extract the short words implied in long words • experiment with the large dictionary • Result: • 0.4090 (performance is higher than Longest matching, but is lower than character with longest • high recall but low precision • instance:「作業系統」、「作業」、「業系」、「系統」 「意外事故」、「意外」、「外事」、「事故」 • combining with single characters is increased to 0.4117
5-4-1.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • Bi-grams are combined with uni-grams • Average precision is 0.4254 • Many bi-grams are meaningless, especially bi-grams containing functional characters ( e.g. 的) • 「大的」、「小的」、「好的」、「你的」… • 智慧型行銷資料探勘分析系統- UniMarketing™ 產品. ..探宇科技在資料探勘分析預測具有大的專業經驗,已涵蓋於金融、零售、醫療、製造等領域,所建立之資料探勘模型,經實際導入後成效是十分大的…
5-4-2.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • Use TFIDF, many meaningless bi-grams could be discovered D1 探勘:20 大的:20 D2 探勘:0 大的:35 D3 探勘:0 大的:5 D4 探勘:0 大的:21
5-4-3.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • The disadvantage of bi-grams with respect to words: • indexing time (2 hours to more than 5 hours) • disc space • larger documents
5-5.Combine Words with n-grams • N.Y.U.S.T. • I.M. 1.combining (longest-matching + characters) and (bi-grams + characters) average precision is 0.4260 2.combining (full segmentation) and (bi-grams + characters) Average precision is 0.4400
5-6-1.impact of unknown word detection • N.Y.U.S.T. • I.M. • Unknowd word which is not stored in the dictionary • 「皮納圖博火山」has been segmented as 「皮」「納」「圖」「博」「火」「山」 • 「蜂窩式」has been segmented as 「蜂窩」「式」 • NLP analyer developed in Microsoft – NLPWin • To recognize such unknown words
5-6-2.impact of unknown word detection • N.Y.U.S.T. • I.M.
5-7-1.Summary • N.Y.U.S.T. • I.M.
6.Conclusion • N.Y.U.S.T. • I.M. • achieve performances: words and n-grams • consider the time and space: words( and characters) • exist unknown words: longest- matching with single characters • no effective means to translate English words to Chinese bi-grams
Opinion • N.Y.U.S.T. • I.M. • Information Retrieval