80 likes | 224 Views
Extracting key-substring-group features for text classification. Advisor : Dr. Hsu Graduate : Chen, Shao-Pei Authors : Dell Zhang, Wee Sun Lee 2006.SIGKDD.10. Outline. Motivation Objective Methodology Experimental Results Conclusion.
E N D
Extracting key-substring-group features for text classification Advisor : Dr. Hsu Graduate : Chen, Shao-Pei Authors : Dell Zhang, Wee Sun Lee 2006.SIGKDD.10
Outline • Motivation • Objective • Methodology • Experimental Results • Conclusion
The huge number of substrings in the training corpus D is an obstacle to making use of discriminative learning methods for text classification. A document of length |d| has at least |d|(|d|+1)/2 matches of substring with itself. Motivation abc
We proposed key-substring-group feature extraction technique solves the effectiveness and efficiency problems of string kernel, and opens numerous promising directions. We proposed a suffix tree based algorithm that can extract such features in linear time. Objective abc bc c
Methodology-suffix tree 1xabxa$ 2abxa$ 3bxa$ 4xa$ 5a$ 6$ 1xabxa$ 4xa$ 2abxa$ 5a$ 3bxa$ 6$ Theorem 2. Given an arbitrary string P, we can find all occurrences of P in S in O(|P|) time taking advantage of the suffix tree T for S. Theorem 3. If P is a substring of S, the occurrence frequency of P in S would be equal to the number of leaves in the subtree of T rooted at P. Theorem 5. Ukkonen’s algorithm can construct the suffix tree T for a string S of length n, along with all its suffix links, in O(n) time. Theorem 8. Suppose the corpus is of size n . Then there are n trivial groups whose substrings occur only once in D, and at most n-1 non-trivial groups. 5
Methodology-algorithm l: the minimum frequency. h: the maximum frequency. b: the minimum number of branches (children). p: the maximum parent-child conditional probability. q: the maximum suffix-link conditional probability. 6
Experimental Results l, h, b, p and q Initially we took the default values for those parameters which would not filter any nontrivial substring-group out. Then we reduced the number of features gradually by adjusting the parameters. English Text Topic Classification Chinese Text Topic Classification:87.3% Greek Text Authorship Classification:92% Greek Text Genre Classification:94%
Conclusion Our proposed key-substring-group feature extraction technique solves the effectiveness and efficiency problems of string kernel, and opens numerous promising directions.