320 likes | 495 Views
A Boosting Algorithm for Classification of Semi-Structured Text. Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science Labs. Backgrounds. Text Classification using Machine Learning categories: topics (sports, finance, politics…)
E N D
A Boosting Algorithm for Classification of Semi-Structured Text Taku Kudo * # Yuji Matsumoto * * Nara Institute Science and Technology # Currently, NTT Communication Science Labs.
Backgrounds • Text Classification using Machine Learning • categories: topics (sports, finance, politics…) • features: bag-of-words (BOW) • methods: SVM, Boosting, Naïve Bayes • Changes in categories • modalities, subjectivities, or sentiments • Changes in text size • document (large) → passage, sentence (small) Our Claim: BOW is not sufficient
Backgrounds, cont. • Straightforward extensions • Add some structural features, e.g., fixed-length N-gram or fixed-length syntactic relations • But… • Ad-hoc and task dependent • require careful feature selections • How to determine the optimal size (length) ? • Use of larger substructures yields an inefficiency • Use of smaller substructures is the same as BOW
Our approach • Semi-structured text • assume that text is represented in a tree • word sequence, dependency tree, base-phrases, XML • Propose a new ML algorithm that can automatically capture relevant substructures in a semi-structured text • Characteristics: • Instance is not a numerical vector but a tree • Use all subtrees as features without any constraints • A compact and relevant feature set is automaticallyselected
Tree classification problem • Goal: • Induce a mapping from given training data • Training data • A set of pairs of tree x and class label y (+1 or -1) +1 -1 a +1 -1 d d d c c d b T= a , , , a a a b c c
Labeled ordered tree, subtree • Labeled ordered tree (or simply tree) • labeled: each node is associated with a label • ordered: siblings are ordered • Subtree • preserves parent-daughter relation • preserves sibling relation • preserves the label a a b c c b d b a b c B is a subtree of A A is a supertree of B
Decision stumps for trees • A simple rule-based classifier • <t, y> is aparameter (rule) of decision stumps c d d <t1, y>=< , +1> <t2, y>=< , -1> a b x = c a b h <t1, y>(x) = 1 h <t2, y>(x) = 1
Decision stumps for trees, cont. • Training: select the optimal rule that maximizes the gain (or accuracy) • F: feature set (a set of all subtrees)
a, +1 +1 +1 +1 +1 0 a, -1 -1 -1 -1 -1 0 b, +1 -1 -1 +1 +1 -1 … d +1 -1 +1 -1 4 c +1 a … Select the optimal rule that yields the maximum gain d 2 +1 +1 -1 +1 b a c -1 Decision stumps for trees, cont. +1 -1 a -1 d d +1 d c c d b a a a a b gain c c <t,y>
Boosting • Decision stumps are too weak • Boosting [Schapire97] • build an weak leaner (decision stumps) Hj • re-weight instances with respect to error rates • repeat 1 to 2 inK times • output a liner combination of H1 ~ HK • Redefine the gain to use Boosting
A variant ofBranch-and-Bound • Define a search space in which whole set of subtrees is given • Find the optimal rule by traversing this search space • Prune the search space by proposing a criterion How to find the optimal rule? • F is too huge to be enumerated explicitly • Need to find the optimal rule efficiently
a 1 b c t 2 4 7 a c a b a 1 1 3 5 6 rightmost- path b c b c 2 2 4 4 a 1 c a b c a b 3 5 6 3 5 6 b c 2 4 7 c a b 7 3 5 6 Right most extension [Asai02, Zaki02] • extend a given tree of size (n-1) by adding a new node to obtain trees of size n • a node is added to the right-most-path • a node is added as the rightmost sibling
Right most extension, cont. • Recursive applications of right most extensions create a search space
Pruning strategy μ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4 Pruning • For all ,propose an upper bound such that • Can prune the node t if , where is a suboptimal gain
Classification algorithm • Modeled as a linear classifier • wt : weight of tree t • -b : bias (default class label)
SVM: Boosting: SVMs andTree Kernel [Collins 02] Tree Kernel: all subtrees are expanded implicitly a • Feature spaces are essentially the same • Learning strategies are different {0,…,1,…1,…,1,…,1,…,1,…,1,…,0,…} b c a b c a a a b c b c
SVM: L2-norm margin • w is expressed in a small number of examples • - support vectors • sparse solution in the example space SVM v.s Boosting [Rätsch 01] • Both are known as Large Margin Classifiers • Metric of margin is different • Boosting: L1-normmargin • w is expressed in a small number of features • - sparse solution in the feature space
SVM v.s Boosting, cont. • Accuracy is task-dependent • Practical advantages of Boosting: • Good interpretability • Can analyze how the model performs or what kinds of features are useful • Compact features (rules) are easy to deal with • Fast classification • Complexity depends on the small number of rules • Kernel methods are too heavy
Sentence classifications • PHS: cell phone review classification (5,741 sent.) • domain: Web-based BBS on PHS, a sort of cell phone • categories: positive review or negative review • MOD: modality identification (1,710 sent.) • domain : editorial news articles • categories: assertion, opinion, or description positive: It is useful that we can know the date and time of E-Mails. negative: I feel that the response is not so good. assertion: We should not hold an optimistic view of the success of POKEMON. opinion: I think that now is the best time for developing the blue print. description: Social function of education has been changing.
Sentence representations • N-gram tree • each word simply modifies the next word • subtree is an N-gram (N is unrestricted) • dependency tree • word-based dependency tree • A Japanese dependency parser, CaboCha, is used • bag-of-words (baseline) response is very good response is verygood
Results • outperforms the baseline (bow) • dep v.s n-gram: comparable (no significant difference) • SVMs show worse performance depending on tasks • overfitting
Interpretability PHS dataset with dependency A: subtrees that include “hard, difficult ” B: subtrees that include “use” 0.0004 be hard to hung up -0.0006 be hard to read -0.0007 be hard to use -0.0017 be hard to … 0.0027 want to use 0.0002 use 0.0002 be in use 0.0001 be easy to use -0.0001 was easy to use -0.0007 be hard to use -0.0019 is easier to use than.. C: subtrees that include “recharge” 0.0028 recharging time is short -0.0041 recharging time is long
Interpretability, cont. PHS dataset with dependency Input: The LCD is large, beautiful and easy to see weight w subtree t 0.00368 be easy to 0.00353 beautiful 0.00237 be easy to see 0.00174 is large 0.00107 The LCD is large 0.00074 The LCD is … 0.00057 The LCD 0.00036 see -0.00001 large
Advantages • Compact feature set • Boosting extracts only 1,783 unique features • The set sizes of distinct 1-gram, 2-gram, and 3-gram are 4,211, 24,206, and 43,658 respectively • SVMs implicitly use a huge number of features • Fast classification • Boosting: 0.531 sec. / 5,741 instances • SVM: 255.42 sec. / 5,741 instances • Boosting is about 480 times faster than SVMs
Conclusions • Assume that text is represented in a tree • Extension of decision stumps • all subtrees are potentially used as features • Boosting • Branch and bound • enables to find the optimal rule efficiently • Advantages: • good interpretability • fast classification • comparable accuracy to SVMs with kernels
Future work • Other applications • Information extraction • semantic-role labeling • parse tree re-ranking • Confidence rated predictions for decision stumps
Thank you! • An implementation of our method is available as an open source softwareat: http://chasen.naist.jp/~taku/software/bact/