An Ensemble-based Approach to Fast Classification of Multi-label Data Streams

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Xiangnan Kong,Philip S. Yu Dept. of Computer Science University of Illinois at Chicago

Introduction: Data Stream • Data Stream: high speed data flow continuously arriving, changing • Applications: • online message classification • network traffic monitoring • credit card transactions classification online message network traffic credit card transactions

Introduction: Stream Classification • Stream Classification: • Construct a classification model on past stream data • Use the model to predict the class label for incoming data Classification Model train classify Training data Incoming data + - + - ? ? ? data stream

Multi-Label Stream Data Conventional Stream Classification: • Single-label settings: assume one stream object can only have one label • In many real apps, one stream object can have multiple labels. Company Legendary Sad • … • Labels • News Article • Emails • Labels

object instance label label …… object instance label …… label Multi-Label Stream Classification • Traditional Stream Classification • Multi-label Stream Classification

…… …… The problem 0 1 • Stream Data • Huge Data Volume + Limited memory cannot store the entire dataset for training Require one-pass algorithm on the stream • High Speed Need to process promptly • Concept Drifts Old data become outdated 0 0 1 0 • Multi-label Classification • Large number of possible label sets (exponential) • Conventional multi-label classification approach focus on offline settings, cannot apply here

Our Solution: • Random Tree very fast in training and testing • Ensemble of multiple trees effective and can reduce the prediction variance • Statistics of multiple labels on the tree nodes effective training/testing on multiple labels • Fading function reduce the influence of old data …

Multi-label Random Tree • Conventional Decision Trees Multi-pass over the dataset Variable selection on each node split Single label prediction Static updates, use the entire dataset including outdated data • Multi-label Random Tree Single-pass over the data Split each node on random variable with random threshold Ensemble of multiple trees Multi-label predictions Fading out old data

Training: update trees Update node statistics d a e c Tree 1 Tree Nt f … … Update node statistics

On the Tree Nodes • Statistics on the node • Aggregated label relevance vector • Aggregated number of instances • Aggregated label set cardinalities • Time stamp of the latest update Tree Node statistics a c • Fading function • The statistics are rescaled with a time fading function • To reduce the effect of the old data on the node statistics

Prediction ? ? ? Aggregate predictions Tree Nt Tree 1 … … • Use the aggregated label relevance to rank all possible labels • Use the aggregated set cardinality to decide how many labels are included in the label set

Experiment Setup • Three methods are compared: • Stream Multi-lAbelRandom Tree (SMART) • Multi-label stream classification with random tree[This Paper] • SMARTwithout fading function • SMART(static): keep updating the trees without fading • Multi-label kNN • state-of-the-art multi-label classification method + sliding window

Data Sets • Three multi-label stream classification datasets: • MediaMill: Video annotation task, from “MediaMill Challenge” • TMC2007:Text classification task, from SDM text mining competition • RCV1-v2: large-scale text classification task, from Reuters dataset • --- # instances • --- # labels • --- # features • --- label density

Evaluation • Multi-Label Metrics [Elisseef&Weston NIPS’02] • Ranking Loss ↓ • Evaluate the performance on the probability outputs • Average number of label pairs being ranked incorrectly • The smaller the better • Micro F1↑ • Evaluate the performance on label set prediction • Consider both micro average of precision and recall • The larger the better • Sequential evaluation with concept drifts • Mixing two streams

Throughput / Efficiency

Effectiveness MediaMill Dataset • Our approach with multi-label streaming random trees performed best in MediaMill dataset SMART (static) without fading func Multi-Label kNN (w=100) Ranking Loss (lower is better) (w=200) (w=400) SMART Multi-label Stream Classification Stream (x 4,300 instances)

Effectiveness MediaMill Dataset SMART Multi-label Stream Classification Multi-Label kNN (w=100) Micro F1 (higher is better) (w=200) (w=400) SMART (static) without fading func Stream (x 4,300 instances)

Experiment Results Micro F1 Ranking Loss MediaMill Dataset RCV-1-v2 Dataset TMC2007 Dataset

Conclusions • An Ensemble-based approach for Fast Classification of Multi-Label Data Stream • Ensemble-based approach (effective) • Predict multiple labels • Very fast in training/updating node statistics and prediction using random trees (efficient) Thank you!

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams

Presentation Transcript

Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier

Enabling Fast Prediction for Ensemble Models on Data Streams

Large Scale Multi-Label Classification

Multi-Label Collective Classification

An Approach to Evaluate Data Trustworthiness Based on Data Provenance

Polyphonic music information retrieval based on multi-label cascade classification system

Multi-Label Feature Selection for Graph Classification

Music Information Retrieval based on multi-label cascade classification system

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

An Automated Classification Algorithm for Multi-wavelength Data

Multi-Label Collective Classification

An Ensemble-based Approach to Fast Classification of Multi-label Data Streams

AN APPROACH TO THE CLASSIFICATION OF SLOPE MOVEMENTS

Ensemble Classification Methods

A k -Nearest Neighbor Based Algorithm for Multi-Label Classification

On Demand Classification of Data Streams

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Sketch based Summarization of Data Streams

Polyphonic music information retrieval based on multi-label cascade classification system

A Multi-Relational Approach to Spatial Classification

Multi-label Associative Classification of Medical Documents from MEDLINE