Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Jiangfeng Gao Jian Zhang

國立雲林科技大學National Yunlin University of Science and Technology • On the Use of Words and N-grams for Chinese Information Retrieval • Advisor：Dr. Hsu • Graduate：Chien-Shing Chen • Author：Jiangfeng Gao • Jian Zhang • Ming Zhou November 2000 Proceedings of the fifth international workshop on on Information retrieval with Asian languages

Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Chinese Segmentation • Words, characters, longest-matching algorithm, full-segmentation, n-grams, bi-grams, uni-grams, TFIDF • Experiments • Conclusions • Opinion

Motivation • N.Y.U.S.T. • I.M. • words and n-grams have been used • experiments on different way and combine words with n-grams • Accuracy of word segmentation ? • Worthwhile to combine words with n-grams ? time, space performance unknown word

Objective • N.Y.U.S.T. • I.M. • results concerning the relationship between word segmentation. • result n-grams and the performance of Chinese IR • finding a good way to index Chinese texts

1-1.Introduction • N.Y.U.S.T. • I.M. • has to done to segment sentences into shorter units that may be indexed instance: “美國派團到伊拉克蒐集海珊犯罪證據，除了國際恐怖集團… “ • segment sentences :“美國派團到伊拉克蒐集海珊犯罪證據” • segment term:「美國」「伊拉克」「蒐集」「海珊」「犯罪」「證據」

1-2.Introduction • N.Y.U.S.T. • I.M. • Words, characters, longest-matching algorithm, full-segmentation, n-grams, bi-grams, uni-grams,TFIDF • combine words with n-grams

2.Chinese Segmentation • N.Y.U.S.T. • I.M. • Segmenting a continuous character string into shorter units: • N-grams • Using words

2-1-1.Segmentation-Grams • N.Y.U.S.T. • I.M. • Grams: • String ABCD, can be segmented into : Uni-grams: A B C D Bi-grams: AB BC CD • cost for indexing in IR much higher as lots more possible units to be considered

2-1-2.Segmentation-Grams • N.Y.U.S.T. • I.M. • Bi-grams can successfully cover most of the words. “美國派團到伊拉克蒐集海珊犯罪證據” • Average length of words in usage is 1.59 美國國派派團 . . 蒐集集海海珊

2-2-1.Segmentation-Words • N.Y.U.S.T. • I.M. • Words: 1.The segmentation of Chinese sentences into words requires linguistic knowledge. • dictionary stores a set of known words • “美國派團到伊拉克蒐集海珊犯罪證據” 2.Heuristic rules on word formation 3.Statistical measures based

2-2-2.Segmentation-Words • N.Y.U.S.T. • I.M. • Statistical measures based on co-occurrences of characters. • Instance: 一套功能強大之資料探勘元件，能幫助軟體開發者依企業之商業智慧需求，快速地開發出具有強大之資料探勘功能的應用軟體…. • “資料探勘” occurrence frequently

2-2-3.Segmentation-Words • N.Y.U.S.T. • I.M. • May be combined in different ways • Dictionary combine with heuristic rules • Statistical approach combine with heuristic rules

2-3.Segmentation • N.Y.U.S.T. • I.M. • There is no one single approach shown down to be clearly superior to the others. • Most approaches can achieve accuracy of over 90%. • A few segmentation errors would not have a sign. impact, • Concern critical words: • 口袋怪獸在全世界魅力無法擋，不但任天堂靠皮卡丘賺進大把鈔票外，連獲得任天堂獨家授權獲得口袋怪獸肖像權的美國公司Forkids Enterinment股票也是連番看漲

2-4.Segmentation • N.Y.U.S.T. • I.M. • Two kinds of Segmentation ambiguities: 1.combinatory ambiguity string AB (strings or characters) may be considered as a single word, and also be separated into A and B 「操作系統」=>「操作」「系統」 2.overlapping ambiguity string ABC may be segmented as : AB C A BC 「…操作系統整合…」=>「操作系統」「系統整合」

2-5-1.Segmentation-Longest matching algorithm • N.Y.U.S.T. • I.M. • Longest matching algorithm • two words A and B, combined into AB which is stored in the dictionary. • longer word is well accepted • “操作系統” is a compound word is composed of shorter compound words.

2-5-2.Segmentation-Longest matching algorithm • N.Y.U.S.T. • I.M. • use Longest matching algorithm : 1.combinatory ambiguity AB 「操作」「系統」=> 「操作系統」 2.overlapping ambiguity forward matching : AB C => 「操作系統」 backward matching : A BC=> 「系統整合」

2-6.Segmentation-suffix structures • N.Y.U.S.T. • I.M. • Recognized using heuristic rules • Date expressions (e.g. 一九九八年) • Suffix structures (e.g. 使用者) • Dictionary-based segmentation is often complemented by a set of heuristic rules.

3-1. impacts of segmentation • N.Y.U.S.T. • I.M. • Longest matching approach described more precise meanings • But, if a long word contain several short words, only the long word will be identified. • “操作系統” will be identified, “操作” “系統” will be ignored.

3-2-1. impacts of segmentation • N.Y.U.S.T. • I.M. 1. full segmentation :extract the short words involved within long words a. consider a sentence as a string b. extract every word that appears at the beginning c. remove first character at the beginning d. repeat until the string is completely removed

3-2-2. impacts of segmentation • N.Y.U.S.T. • I.M. • 「操作系統」=>「操」「操作」「操作系」 • 「作系統」=>「作」「作系」「作系統」 • 「系統」=>「系」「系統」 • 「統」=>「統」 • Empty • How to produce 「操作系統」「操作」「系統」 ? • combine full-segmentation with longest matching algorithm 操作系統

3-3. impacts of segmentation • N.Y.U.S.T. • I.M. 2.combine longest words with characters (uni-grams) • reasonable compromise between precision and recall [PIRCS] (Ad-hoc + occurrence + threshold) 口袋怪獸在全世界魅力無法檔口 :1000 times 口袋 :900 times 口袋怪 :850 times 口袋怪獸:830 times 口袋怪獸在 :5 times

3-4-1. impacts of segmentation • N.Y.U.S.T. • I.M. 3.combine bi-grams with uni-grams (characters) Some single characters are completely meaningful alone (e.g 造), and forced to combine with another character • may consider unknown words 造紙造船造車造物

3-4-2. impacts of segmentation • N.Y.U.S.T. • I.M. • unknown word : 大海灣 (not exist in dictionary) • 大，海，灣 ( using uni-grams) • 大海，海灣(using bi-grams) (both bi-grams occur in the same document) 大 :1000 times 海 :1000 times 灣 :1000 times 大海:1000 times 海灣:1000 times

3-5. impacts of segmentation • N.Y.U.S.T. • I.M. 4.Combine Words and bi-grams • Words : rely on linguistic knowledge • bi-grams: statistical information “貪玩的皮卡丘跑到大城市去逛街“

3-6. impacts of segmentation • N.Y.U.S.T. • I.M. • create a representation of a text or query • keyword • compound terms • certain combination

4-1.Experimtents settings • N.Y.U.S.T. • I.M. • Data : TREC Chinese corpus • Documents : • People’s Daily from 1991 to 1993 • a part of the news released by the Xinhua News Agency in 1994 and 1995. • A set of 54 queries has been set up and evaluated by people in the NIST

4-2-1.Experimtents settings • N.Y.U.S.T. • I.M. • The index result for a document is a vector of weights: • where dik (l<_k<_m) is the weight of the term tk in the document D i, and m is the size of the vector space

4-2-2.Experimtents settings • N.Y.U.S.T. • I.M. • Use TFIDF, many meaningless bi-grams could be discovered “智慧型行銷資料探勘分析系統- UniMarketing™ 產品. ..探宇科技在資料探勘分析預測具有大的專業經驗，已涵蓋於金融、零售、醫療、製造等領域，所建立之資料探勘模型，經實際導入後成效是十分大的…”

4-4-3.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • idf(探勘) log(4/1)=2 • idf(大的) log(4/4)=0 D1 探勘：20 大的：20 D2 探勘：0 大的：35 D3 探勘：0 大的：5 D4 探勘：0 大的：21

4-3.Experimtents settings • N.Y.U.S.T. • I.M. • The indexing result for a query is a vector of weights : • similarity between Di and is calculated

5.Experimtents • N.Y.U.S.T. • I.M. 1.Longest matching with two dictionaries 2.combining characters (uni-grams) with Longest matching 3.Full segmentation with or without adding characters 4.using bi-grams and characters 5.Combining words with bi-grams and characters 6.Adding an unknown word detection

5-1. longest matching with two dictionaries • N.Y.U.S.T. • I.M. • Result: • small dictionary :65502 entries average precision of 0.3797 • Large dictionary :220 K entries average precision of 0.3907 • instance: 「作業系統」 • size is not the only one

5-2.single characters with longest words • N.Y.U.S.T. • I.M. • because of short words included in long words are ignored • Result: • small dictionary :0.4058 (improvement 6.9%) • large dictionary: 0.4290 (improvement 9.8%) • instance:「作業系統」、「作業」、「系統」 • more effective way than increase the size

5-3.full segmentation • N.Y.U.S.T. • I.M. • extract the short words implied in long words • experiment with the large dictionary • Result: • 0.4090 (performance is higher than Longest matching, but is lower than character with longest • high recall but low precision • instance:「作業系統」、「作業」、「業系」、「系統」「意外事故」、「意外」、「外事」、「事故」 • combining with single characters is increased to 0.4117

5-4-1.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • Bi-grams are combined with uni-grams • Average precision is 0.4254 • Many bi-grams are meaningless, especially bi-grams containing functional characters ( e.g. 的) • 「大的」、「小的」、「好的」、「你的」… • 智慧型行銷資料探勘分析系統- UniMarketing™ 產品. ..探宇科技在資料探勘分析預測具有大的專業經驗，已涵蓋於金融、零售、醫療、製造等領域，所建立之資料探勘模型，經實際導入後成效是十分大的…

5-4-2.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • Use TFIDF, many meaningless bi-grams could be discovered D1 探勘：20 大的：20 D2 探勘：0 大的：35 D3 探勘：0 大的：5 D4 探勘：0 大的：21

5-4-3.Chinese IR using n-grams • N.Y.U.S.T. • I.M. • The disadvantage of bi-grams with respect to words: • indexing time (2 hours to more than 5 hours) • disc space • larger documents

5-5.Combine Words with n-grams • N.Y.U.S.T. • I.M. 1.combining (longest-matching + characters) and (bi-grams + characters) average precision is 0.4260 2.combining (full segmentation) and (bi-grams + characters) Average precision is 0.4400

5-6-1.impact of unknown word detection • N.Y.U.S.T. • I.M. • Unknowd word which is not stored in the dictionary • 「皮納圖博火山」has been segmented as 「皮」「納」「圖」「博」「火」「山」 • 「蜂窩式」has been segmented as 「蜂窩」「式」 • NLP analyer developed in Microsoft – NLPWin • To recognize such unknown words

5-6-2.impact of unknown word detection • N.Y.U.S.T. • I.M.

5-7-1.Summary • N.Y.U.S.T. • I.M.

6.Conclusion • N.Y.U.S.T. • I.M. • achieve performances: words and n-grams • consider the time and space: words( and characters) • exist unknown words: longest- matching with single characters • no effective means to translate English words to Chinese bi-grams

Opinion • N.Y.U.S.T. • I.M. • Information Retrieval

Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Jiangfeng Gao Jian Zhang

Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Jiangfeng Gao Jian Zhang

Presentation Transcript

Graduate Student Orientation

Prof. Liang- Chien Chen

Presenter: Chien-Chih Chen

David Zermeño Advisor: Chidong Zhang

Selecting Attributes for Sentiment Classification Using Feature Relation Networks

D 3 S: Debugging Deployed Distributed Systems

PI: Nick Younan Roger King, Surya Durbha, Fengxiang Han Zhiling Long, Jian Chen

Date : 2012/12/20 Author : Rajvardhan Patil , Zhengxin Chen Source : KEYS’12

The Graduate Advisor

Presenter: Chien-Chih Chen

Speaker: L. C. Chen Advisor: R. C. T. Lee

Advisor ： Dr. Hsu Graduate ： Ching-Lung Chen Author ： Victoria J. Hodge Jim Austin

Group Summary

Boosting an Associative Classifier

Speaker: L. C. Chen Advisor: R. C. T. Lee

PI: Nicolas H. Younan Surya S. Durbha, Fengxiang Han, Roger L. King, Jian Chen, Zhiling Long

NBDL (National Biology Digital Library) A NSDL Core Integration System Project PI: Su-Shing Chen

Advisor: Hsin-Hsi Chen Reporter: Y.H Chang 2008-03-21

Mining Fuzzy Multiple-Level Association Rules from Quantitative Data

Support Vector Machines Classification with A Very Large-scale Taxonomy

Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Questionnaire-Responded Transaction Approach with SVM for Credit Card Fraud Detection