1 / 32

A Boosting Algorithm for Classification of Semi-Structured Text

A Boosting Algorithm for Classification of Semi-Structured Text. Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science Labs. Backgrounds. Text Classification using Machine Learning categories: topics (sports, finance, politics…)

roch
Download Presentation

A Boosting Algorithm for Classification of Semi-Structured Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Boosting Algorithm for Classification of Semi-Structured Text Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science Labs.

  2. Backgrounds • Text Classification using Machine Learning • categories: topics (sports, finance, politics…) • features: bag-of-words (BOW) • methods: SVM, Boosting, Naïve Bayes • Changes in categories • modalities, subjectivities, or sentiments • Changes in text size • document (large) → passage, sentence (small) Our Claim: BOW is not sufficient

  3. Backgrounds, cont. • Straightforward extensions • Add some structural features, e.g., fixed-length N-gram or fixed-length syntactic relations • But… • Ad-hoc and task dependent • require careful feature selections • How to determine the optimal size (length) ? • Use of larger substructures yields an inefficiency • Use of smaller substructures is the same as BOW

  4. Our approach • Semi-structured text • assume that text is represented in a tree • word sequence, dependency tree, base-phrases, XML • Propose a new ML algorithm that can automatically capture relevant substructures in a semi-structured text • Characteristics: • Instance is not a numerical vector but a tree • Use all subtrees as features without any constraints • A compact and relevant feature set is automaticallyselected

  5. Classifier for Trees

  6. Tree classification problem • Goal: • Induce a mapping from given training data • Training data • A set of pairs of tree x and class label y (+1 or -1) +1 -1 a +1 -1 d d d c c d b T= a , , , a a a b c c

  7. Labeled ordered tree, subtree • Labeled ordered tree (or simply tree) • labeled: each node is associated with a label • ordered: siblings are ordered • Subtree • preserves parent-daughter relation • preserves sibling relation • preserves the label a a b c c b d b a b c B is a subtree of A A is a supertree of B

  8. Decision stumps for trees • A simple rule-based classifier • <t, y> is aparameter (rule) of decision stumps c d d <t1, y>=< , +1> <t2, y>=< , -1> a b x = c a b h <t1, y>(x) = 1 h <t2, y>(x) = 1

  9. Decision stumps for trees, cont. • Training: select the optimal rule that maximizes the gain (or accuracy) • F: feature set (a set of all subtrees)

  10. a, +1 +1 +1 +1 +1 0 a, -1 -1 -1 -1 -1 0 b, +1 -1 -1 +1 +1 -1 … d +1 -1 +1 -1 4 c +1 a … Select the optimal rule that yields the maximum gain d 2 +1 +1 -1 +1 b a c -1 Decision stumps for trees, cont. +1 -1 a -1 d d +1 d c c d b a a a a b gain c c <t,y>

  11. Boosting • Decision stumps are too weak • Boosting [Schapire97] • build an weak leaner (decision stumps) Hj • re-weight instances with respect to error rates • repeat 1 to 2 inK times • output a liner combination of H1 ~ HK • Redefine the gain to use Boosting

  12. Efficient Computation

  13. A variant ofBranch-and-Bound • Define a search space in which whole set of subtrees is given • Find the optimal rule by traversing this search space • Prune the search space by proposing a criterion How to find the optimal rule? • F is too huge to be enumerated explicitly • Need to find the optimal rule efficiently

  14. a 1 b c t 2 4 7 a c a b a 1 1 3 5 6 rightmost- path b c b c 2 2 4 4 a 1 c a b c a b 3 5 6 3 5 6 b c 2 4 7 c a b 7 3 5 6 Right most extension [Asai02, Zaki02] • extend a given tree of size (n-1) by adding a new node to obtain trees of size n • a node is added to the right-most-path • a node is added as the rightmost sibling

  15. Right most extension, cont. • Recursive applications of right most extensions create a search space

  16. Pruning strategy μ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4 Pruning • For all ,propose an upper bound such that • Can prune the node t if , where is a suboptimal gain

  17. Upper bound of the gain(an extension of [Morishita 02])

  18. Relation to SVMs with Tree Kernel

  19. Classification algorithm • Modeled as a linear classifier • wt : weight of tree t • -b : bias (default class label)

  20. SVM: Boosting: SVMs andTree Kernel [Collins 02] Tree Kernel: all subtrees are expanded implicitly a • Feature spaces are essentially the same • Learning strategies are different {0,…,1,…1,…,1,…,1,…,1,…,1,…,0,…} b c a b c a a a b c b c

  21. SVM: L2-norm margin • w is expressed in a small number of examples • - support vectors • sparse solution in the example space SVM v.s Boosting [Rätsch 01] • Both are known as Large Margin Classifiers • Metric of margin is different • Boosting: L1-normmargin • w is expressed in a small number of features • - sparse solution in the feature space

  22. SVM v.s Boosting, cont. • Accuracy is task-dependent • Practical advantages of Boosting: • Good interpretability • Can analyze how the model performs or what kinds of features are useful • Compact features (rules) are easy to deal with • Fast classification • Complexity depends on the small number of rules • Kernel methods are too heavy

  23. Experiments

  24. Sentence classifications • PHS: cell phone review classification (5,741 sent.) • domain: Web-based BBS on PHS, a sort of cell phone • categories: positive review or negative review • MOD: modality identification (1,710 sent.) • domain : editorial news articles • categories: assertion, opinion, or description positive: It is useful that we can know the date and time of E-Mails. negative: I feel that the response is not so good. assertion: We should not hold an optimistic view of the success of POKEMON. opinion: I think that now is the best time for developing the blue print. description: Social function of education has been changing.

  25. Sentence representations • N-gram tree • each word simply modifies the next word • subtree is an N-gram (N is unrestricted) • dependency tree • word-based dependency tree • A Japanese dependency parser, CaboCha, is used • bag-of-words (baseline) response is very good response is verygood

  26. Results • outperforms the baseline (bow) • dep v.s n-gram: comparable (no significant difference) • SVMs show worse performance depending on tasks • overfitting

  27. Interpretability PHS dataset with dependency A: subtrees that include “hard, difficult ” B: subtrees that include “use” 0.0004 be hard to hung up -0.0006 be hard to read -0.0007 be hard to use -0.0017 be hard to … 0.0027 want to use 0.0002 use 0.0002 be in use 0.0001 be easy to use -0.0001 was easy to use -0.0007 be hard to use -0.0019 is easier to use than.. C: subtrees that include “recharge” 0.0028 recharging time is short -0.0041 recharging time is long

  28. Interpretability, cont. PHS dataset with dependency Input: The LCD is large, beautiful and easy to see weight w subtree t 0.00368 be easy to 0.00353 beautiful 0.00237 be easy to see 0.00174 is large 0.00107 The LCD is large 0.00074 The LCD is … 0.00057 The LCD 0.00036 see -0.00001 large

  29. Advantages • Compact feature set • Boosting extracts only 1,783 unique features • The set sizes of distinct 1-gram, 2-gram, and 3-gram are 4,211, 24,206, and 43,658 respectively • SVMs implicitly use a huge number of features • Fast classification • Boosting: 0.531 sec. / 5,741 instances • SVM: 255.42 sec. / 5,741 instances • Boosting is about 480 times faster than SVMs

  30. Conclusions • Assume that text is represented in a tree • Extension of decision stumps • all subtrees are potentially used as features • Boosting • Branch and bound • enables to find the optimal rule efficiently • Advantages: • good interpretability • fast classification • comparable accuracy to SVMs with kernels

  31. Future work • Other applications • Information extraction • semantic-role labeling • parse tree re-ranking • Confidence rated predictions for decision stumps

  32. Thank you! • An implementation of our method is available as an open source softwareat: http://chasen.naist.jp/~taku/software/bact/

More Related