260 likes | 935 Views
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier. IBM Research – China IBM T.J.Watson Research Center Presenter: Xiatian Zhang xiatianz@cn.ibm.com Authors: Xiatian Zhang, Quan Yuan, Shiwan Zhao,
E N D
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier IBM Research – China IBM T.J.Watson Research Center Presenter: Xiatian Zhang xiatianz@cn.ibm.com Authors: Xiatian Zhang, Quan Yuan, Shiwan Zhao, Wei Fan, Wentao Zheng, Zhong Wang
Multi-label Classification Tree Winter • Classical Classification (Single Label Classification) • The classes are exclusive: if an example belongs to one class, it can’t be belongs to others • Multi-label Classification • A picture, video, article may belong to several compatible categories • A pieces of gene can control several biological functions Park Ice Lake
Existed Multi-label Classification Methods • Grigorios Tsoumakas et al[2007] summarize the existing methods for ML-Classification • Two Strategies • Problem Transformation • Transfer Multi-label Classification Problem to Single Classification Problem • Algorithm Adaptation • Adapt Single-label Classifiers to Solve the Multi-label Classification Problem • With high complexity
Classifier L1 L2 L3 L1 L1 L2 L2 L1 L3 L3 L3 L4 L1+ L2+ L3+ L1- L2- L3- Classifier1 Classifier2 Classifier3 Problem Transformation Approaches • Label Powerset (LP) • Label Powerset considers each unique subset of labels that exists in the multi-label dataset as a single label • Binary Relevance (BR) • Binary Relevance learns one binary classier for each label
Large Number of Labels Problem • Hundreds and even more labels • Text categorization • protein function classification • semantic annotation of multimedia • The Impacts to Multi-label Classification Methods • Label Powerset: the number of training examples for each particular label will be much less • Binary Relevance: The computational complexity is with linear complexity with respect to the number of labels • Algorithm Adaptation: Even more worse than Binary Relevance
HOMER for Large Number of Labels Problem • HOMER (Hierarchy Of Multilabel classifERs) is developed by Grigorios Tsoumakas et al, 2008. • The HOMER algorithm constructs a Hierarchy Of Mul-tilabel classifERs, each one dealing with a much smaller set of labels.
Our Method – Without Label Cost • Without Label Cost • Training Time is almost irrelevant with number of labels |L| • But with Reliable Quality • The classification Quality can be compared to mainstream methods over different data sets. • How to make it?
Our Method – Without Label Cost cont. • Binary Relevance Method based on Random Decision Tree • Random Decision Tree [Fan et al, 2003] • Training Process is irrelevant with label information • Random Construction with very low cost • Stable quality on many applications
Random Decision Tree – Tree Construction • At each node, an un-used feature is chosen randomly • A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. • A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen • It stop when one of the following happens: • A node becomes too small (<= 4 examples). • Or the total height of the tree exceeds some limits: • Such as the total number of features. • The construction process is irrelevant with label information
F1<0.5 Y N F2>0.7 F3>0.3 Y N N +:200 -: 10 … … +:30 -: 70 Random Decision Tree - Node Statistics • Classification and Probability Estimation: • Each node of the tree keeps the number of examples belonging to each class. • The node statistics process cost a little computation resource
F1<0.5 Y N F2>0.7 F3>0.3 Y N N +:200 -: 10 … … +:30 -: 70 Random Decision Tree - Classification • During classification, each tree outputs posterior probability: P(+|x)=30/100 =0.3
F3>0.3 F1<0.5 Y Y N N F2>0.7 F2<0.6 F3>0.3 F1>0.7 Y Y N N N N +:200 -: 10 +:100 -:120 … … … … +:30 -: 20 +:30 -: 70 Random Decision Tree - Ensemble • For a instance x, average the estimated probability on each tree and take the average probability as the predicted probability of x. P’(+|x)=30/50 =0.6 P(+|x)=30/100=0.3 (P(+|x)+P’(+|x))/2 = 0.45
F1<0.5 F3>0.5 Y Y N N F2>0.7 F2<0.7 F3>0.3 F1>0.7 N N Y Y N N L1+:100 L1-:120 L1+:200 L1-: 10 L2+:40 L2-: 60 L1+:200 L1-: 10 … … … … L1+:30 L1-: 70 L1+:30 L1-: 20 L2+:20 L2-: 80 L2+:50 L2-: 50 Multi-label Random Decision Tree P(L1+|x)=30/100=0.3 P’(L1+|x)=30/50 =0.6 P(L2+|x)=50/100=0.5 P’(L2+|x)=20/100=0.2 (P(L1+|x)+P’(L1+|x))/2 = 0.45 (P(L2+|x)+P’(L2+|x))/2 = 0.35
Why RDT Works? • Ensemble Learning View • Our Analysis • Other Explanations • Non-Parametric Estimation
Complexity of Multi-label Random Decision Tree • Training Complexity: • m is the number of trees, and n is the number of instances • t is the average number of labels on each leaf nodes, t<<n, and t<<|L|. • It is irrelevant with number of labels |L|. • Complexity of C4.5: Viis the size of values of i-th attribute. • Complexity of HOMER: • Test Complexity: • q is the average depth of branches of trees • It is also irrelevant with number of labels |L|
Experiment – Metrics and Datasets • Quality Metrics: • Datasets:
Future Works • Leverage the relationship of labels. • Apply ML-RDT for Recommendation • Parallelization and Streaming Implementation