330 likes | 537 Views
Feature Selection on Chinese Text Classification Using Character N-grams. Tongji University, Key laboratory "Embedded System and Service Computing" Ministry of Education Ph.D : Zhihua WEI Superviser: Duoqian MIAO Jean-Hugues CHAUCHAT. Outline. Text representation
E N D
Feature Selection on Chinese Text Classification Using Character N-grams Tongji University, Key laboratory "Embedded System and Service Computing" Ministry of Education Ph.D : Zhihua WEI Superviser: Duoqian MIAO Jean-Hugues CHAUCHAT
Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Problems of Chinese text representation The problems of Chinese text processing: Chinese words do not have a remarkable boundary. • Word segmentation is necessary before any other preprocessing. The use of a dictionary is necessary. • Word sense disambiguation (WSD) and unknown words recognition are two difficulties of word segmentation. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
A example: two segmentation possibilities Sentence: 流感到冬天很普遍。 Can be segmented as: • 流感/ 到/ 冬天 / 很 / 普遍/。 • 流/ 感到 / 冬天 / 很 / 普遍/ 。 Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
A example: two segmentation possibilities • 流感 / 到 / 冬天 / 很 / 普遍/。 (right) Flu arrive winter very popular Flu is popular in winter. • *流 / 感到 / 冬天 / 很 / 普遍/ 。 (error) Flow feel winter very popular Flow feels that winter is universal. Here “流感” is closely related to the document theme (e.g. medicine). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
What is word n-gram? Two kinds of n-gram definitions. 我们明天去北京。 After word segmentation: 我们| 明天| 去| 北京 1-grams:{我们; 明天; 去; 北京} 2-grams:{我们明天; 明天去; 去北京} 3-grams:{我们明天去; 明天去北京} Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
What is charactern-gram? Character n-gram: 我们明天去北京。 1-grams:{我; 们; 明; 天; 去; 北; 京} 2-grams:{我们; 们明; 明天; 天去; 去北; 北京} 3-grams:{我们明; 们明天; 明天去; 天去北; 去北京} In our work, we use character n-grams. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Advantage and disadvantage of character n-grams Advantage: • Independent of language • Avoid the problem of word segmentation Disadvantage: • Yield a large number of n-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Text representation by vectors We adopt VSM (vector space model). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Choose “n” in n-gram • Results of Zhou Shuigeng et al. (2001) 1-, 2-, 3-, 4-grams (best result) 1-,2-grams 2-grams 2-, 3-, 4-grams 1-grams (worst result) • Result of Lelu et al.(1998) 2-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Choose “n” in n-gram • “n” selection in our experiments. (1) 1-,2-grams (2) 1-,2-,3-grams Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Some definitions Each text in corpus D belongs to one class ci. Here, ci∈C, C = {c1, c2, … ci…cn} is the class set defined before classification. • Text_freqij= the number of texts which include n-gram j in class ci. • Text_freq_relativeij = Text_freqij / Ni, here, Ni is the quantity of texts in class ci in training set; • Gram_freqij= the number of n-gram j in all texts in class ci in training set; • Gram_freq_relativeij= Gram_freqij / N’i, here, N’i is the total of occurrence of all n-grams in all texts in class ci in training set Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Feature selection • Two steps: • Inter-class feature number reduction • Cross-class feature selection Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Inter-class feature number reduction(1) Adopt relative text frequency as threshold. Before selection: >15,000 features/class selected by threshold: Text_freq_relativeij<0.02 : Remain about 7,000 features/class Text_freq_relativeij<0.03: Remain about 4,000 features/class Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Inter-class feature number reduction(2) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Cross-class feature selection(1) • Cross-class feature selection by Chi-Square Here, Select Oij: 1.Text_freq_relativeij; 2.Gram_freq_relativeij; 3. Text_freqij; 4. Gram_freqij; Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Cross-class feature selection(2) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Experiments dataset The distribution of TanCorp-12 (M= megabyte) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Experiment scenarios Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Classifier • C-SVC classifier (introduced in LIBSVM). It is the SVM algorithm designed for the multi-classification task. • Platform: TANAGRA (developed by Ricco Rakotomalala) Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Evaluation F1 measure in bi-classifier: F1 in our work (more than two classes): Micro - F1 = average in documents and classes Macro - F1 = average of within - category F1 values Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Outline • Text representation • What is n-gram? Why we use n-gram? • Feature selection • Introduction of experiments • Results and discussions Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
1-,2-grams vs. 1-,2-,3-grams Ex_3>Ex_5 Ex_2>Ex_6 1-,2-grams is better! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
N-gram frequency vs. text frequency Ex_1>Ex_2 Ex_3>Ex_4 Ex_5>Ex_6 n-gram freq. is better! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Absolute freq. vs. relative freq. Ex_1 ≈ Ex_3 Ex_2 ≈ Ex_4 The two are similar!! Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Sparseness comparison [Šilić Artur et. Al, 2007] shows that the computational time is more linked with the number of non-zero values in the cross-table (document by feature) than with its number of columns (features). Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Conclusion • The feature selection methods based on n-gram frequency (absolute or relative) always give better results than those based on text frequency (absolute or relative). • Relative frequency is not better than the absolute frequency. • Methods based on n-gram frequency also produce denser “document by feature” matrices. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Confusion Matrix Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Future work • Better methods in n-gram feature selection. • Test its result in hierarchical classification tasks. Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education
Thank you! Questions ? Tongji University. Key laboratory “Embedded System and Service Computing” Ministry of Education