230 likes | 382 Views
An Ensemble-based Approach to Fast Classification of Multi-label Data Streams. Xiangnan Kong, Philip S. Yu. Dept. of Computer Science University of Illinois at Chicago. Introduction: Data Stream. Data Stream: high speed data flow continuously arriving, changing. Applications:
E N D
An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Xiangnan Kong,Philip S. Yu Dept. of Computer Science University of Illinois at Chicago
Introduction: Data Stream • Data Stream: high speed data flow continuously arriving, changing • Applications: • online message classification • network traffic monitoring • credit card transactions classification online message network traffic credit card transactions
Introduction: Stream Classification • Stream Classification: • Construct a classification model on past stream data • Use the model to predict the class label for incoming data Classification Model train classify Training data Incoming data + - + - ? ? ? data stream
Multi-Label Stream Data Conventional Stream Classification: • Single-label settings: assume one stream object can only have one label • In many real apps, one stream object can have multiple labels. Company Legendary Sad • … • Labels • News Article • Emails • Labels
object instance label label …… object instance label …… label Multi-Label Stream Classification • Traditional Stream Classification • Multi-label Stream Classification
…… …… The problem 0 1 • Stream Data • Huge Data Volume + Limited memory cannot store the entire dataset for training Require one-pass algorithm on the stream • High Speed Need to process promptly • Concept Drifts Old data become outdated 0 0 1 0 • Multi-label Classification • Large number of possible label sets (exponential) • Conventional multi-label classification approach focus on offline settings, cannot apply here
Our Solution: • Random Tree very fast in training and testing • Ensemble of multiple trees effective and can reduce the prediction variance • Statistics of multiple labels on the tree nodes effective training/testing on multiple labels • Fading function reduce the influence of old data …
Multi-label Random Tree • Conventional Decision Trees Multi-pass over the dataset Variable selection on each node split Single label prediction Static updates, use the entire dataset including outdated data • Multi-label Random Tree Single-pass over the data Split each node on random variable with random threshold Ensemble of multiple trees Multi-label predictions Fading out old data
Training: update trees Update node statistics d a e c Tree 1 Tree Nt f … … Update node statistics
On the Tree Nodes • Statistics on the node • Aggregated label relevance vector • Aggregated number of instances • Aggregated label set cardinalities • Time stamp of the latest update Tree Node statistics a c • Fading function • The statistics are rescaled with a time fading function • To reduce the effect of the old data on the node statistics
Prediction ? ? ? Aggregate predictions Tree Nt Tree 1 … … • Use the aggregated label relevance to rank all possible labels • Use the aggregated set cardinality to decide how many labels are included in the label set
Experiment Setup • Three methods are compared: • Stream Multi-lAbelRandom Tree (SMART) • Multi-label stream classification with random tree[This Paper] • SMARTwithout fading function • SMART(static): keep updating the trees without fading • Multi-label kNN • state-of-the-art multi-label classification method + sliding window
Data Sets • Three multi-label stream classification datasets: • MediaMill: Video annotation task, from “MediaMill Challenge” • TMC2007:Text classification task, from SDM text mining competition • RCV1-v2: large-scale text classification task, from Reuters dataset • --- # instances • --- # labels • --- # features • --- label density
Evaluation • Multi-Label Metrics [Elisseef&Weston NIPS’02] • Ranking Loss ↓ • Evaluate the performance on the probability outputs • Average number of label pairs being ranked incorrectly • The smaller the better • Micro F1↑ • Evaluate the performance on label set prediction • Consider both micro average of precision and recall • The larger the better • Sequential evaluation with concept drifts • Mixing two streams
Effectiveness MediaMill Dataset • Our approach with multi-label streaming random trees performed best in MediaMill dataset SMART (static) without fading func Multi-Label kNN (w=100) Ranking Loss (lower is better) (w=200) (w=400) SMART Multi-label Stream Classification Stream (x 4,300 instances)
Effectiveness MediaMill Dataset SMART Multi-label Stream Classification Multi-Label kNN (w=100) Micro F1 (higher is better) (w=200) (w=400) SMART (static) without fading func Stream (x 4,300 instances)
Experiment Results Micro F1 Ranking Loss MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset
Experiment Results Micro F1 Ranking Loss MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset
Conclusions • An Ensemble-based approach for Fast Classification of Multi-Label Data Stream • Ensemble-based approach (effective) • Predict multiple labels • Very fast in training/updating node statistics and prediction using random trees (efficient) Thank you!