Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

Traditional Machine Learning • Create train/test splits of the data (possibly via cross-validation) • Load ALL the training data into main memory • Compute a model from the training data (may involve multiple passes) • Load all the test data into main memory • Compute accuracy measures from the test data

Consequences • Data is processed one instance at a time • Very few incremental methods – none used seriously in practice. • Many existing techniques don’t scale • Machine Learning is perceived as a small to medium data set field of study. • Larger data sets are tackled through sampling or building several models on separate portions and combining their predictions

Data Streams • Takes the “stream” view, source maybe finite or infinite • Concept of train/test less well defined, could train for a while then test for a while – what is the definition of “a while”? • What ever you do you can be sure that ALL the data will NOT fit in main memory

Data Stream Constraints • Cannot store instances (not all anyway) • Cannot use more than available memory – no swapping to disk • Cannot dwell too long over an instance - must keep up with the incoming rate of instances • Cannot wait to make predictions – need to be ready to make predictions at any time

Scaling up existing methods • Could learn models using existing methods in batches and then merge models • Could merge instances (meta-instances) • Could use a cache model where we keep a set of models and update the cache in time – eg use least recently used, least frequently used type strategies • Could do the above but use performance measures to decide the make-up of the cache.

Caching in Data Communications • Web proxy caches provide a good model for what we need to satisfy stream constraints • Real caches are hierarchical (Squid) • The hierarchy provides a mechanism for sharing the load and increasing the likelihood of a page hit • When full a cache needs a replacement policy • To replicate this system we need to design a hierarchy, fill it (with models) and implement a model replacement policy

General CHIC Architecture • Idea: Build a hierarchy of levels (N) as follows: • Level Zero: data buffer from stream • Level One: Build models from data at level zero • Level Two to N-1: Fill with “best” models from lower levels • Level N: Adopt models from level N-1 but also discard models so that new ones can be entered • For prediction use all models in hierarchy and vote

Features • Can use any machine learning algorithm to build models • Implements a form of batch-incremental learning • Replacement policy can be performance based • As with the web cache CHIC fills up initially and then keeps the best performing models • If a variety of models is used at the lower levels then it is possible to adapt to the source of the data.

Experimental Design • Try to demonstrate adaptation to data source • Learn a mixture of models at levels one and two and let the performance based promotion mechanism take over • Evaluate: two issues • Need performance measure (model promotion/deletion) • Overall performance of hierarchy (adopt a first test then train approach)

Data Sources • Random Naïve Bayes • Weight attribute labels and class values (here we use 5 classes, 5 nominal and 5 numeric attributes) • Random Tree • Choose a depth and randomly assign splitter nodes, here 5 nodes deep, leaves starting from depth 3, as above number of attributes/classes. • Random RBF • Random set of centers for each class, center is weighted and has a standard deviation – here 50 centers, 10 numeric attributes and 5 classes • Real data • Forest covertype (UCI repository), 7 classes, 500K instances.

Specific CHIC Architecture • Six levels with 16 models per level • Data buffer of size 1000 instances • First level uses four algorithms to generate models • Naïve Bayes(N), C4.5(J), linear SVM(L), RBF Kernel SVM(R)

Example • Read 1000 instances into buffer, build 4 models, repeat on next 3 buffers – leads to level one full • Read next 1000 and build 4 more models • Using the buffer data as test data evaluate all models at level 1 – promote best 4 (groupwise) to level 2 and replace the worst 4 with the new ones. • Continue in this manner (at 12,000 instances level 2 will be full) • Note: Levels 1&2 always have 4 models from the 4 groups • From level 2 ONLY promote the best (adapt to source)

Example Contd • Four models are promoted to level 3 the 4 worst deleted freeing 8 spaces. • Levels 3, 4 & 5 work on the same basis. Level 6 simply deletes the worst 4 models to free up space. • At prediction time all models have an equal vote and the classification decision is arrived at by majority.

Results after 1000 instances: after 2000 instances: N---J---L---R--- NN--JJ--LL--RR-- after 3000 instances: after 4000 instances: NNN-JJJ-LLL-RRR- NNNNJJJJLLLLRRRR after 5000 instances: after 6000 instances: N-NNJJ-JL-LLR-RR NNNNJJJJLLLLRRRR NJLR------------ NJLR------------ after 7000 instances: after 8000 instances: NNN-J-JJL-LLR-RR NNNNJJJJLLLLRRRR NJLRNJLR-------- NJLRNJLR--------

Continued after 9000 instances: after 10000 instances: NN-N-JJJLL-L-RRR NNNNJJJJLLLLRRRR NJLRNJLRNJLR---- NJLRNJLRNJLR---- after 11000 instances: after 12000 instances: -NNN-JJJLLL-RR-R NNNNJJJJLLLLRRRR NJLRNJLRNJLRNJLR NJLRNJLRNJLRNJLR after 13000 instances: NN-NJJ-JLL-LRR-R NNLJNLLRN-L-N-L- JJJJ------------

Model Adaptation to Source random treerandom naive Bayes N-NN-JJJLL-LR-RR -NNNJJ-JLLL-R-RR NNNNLNNLJLRNNJLR NJJJLLLLRNJJJLLR JJJJJNNLJJ---JL- NNNNNJJNJJ-NN--- JJJJJJJJJJJJJJJJ NNNNNNNNNNNNNNNN JJJJ--JJ-JJJJJ-J NNNNNNNN---NNNN- JJJJJJJJJJJJJJJJ NNNNNNNNNNNNNNNN

Continued random RBFcovertype NNN--JJJLL-LRR-R N-NN-JJJ-LLLR-RR NNJJLRNJNNJJNJLR NNJJJRJLNRJ---R RRJRJJRRRRJRRJJJ LJJLLJLJLJL--N-- RRRRRRRRRRRRRRRR LLLLLLLLLLLLJLLJ RRRRRRRRRRRRRRRR LLLLLLLLJL-J-L-- RRRRRRRRRRRRRRRR LLLLLLLLLLLLLLLL

Learning Curve – Random Tree

Learning Curve - CoverType

Conclusion • Novel architecture for data streams • Operates much like a web cache (hierarchy and replacement policy) • Provides scaling-up mechanism for arbitrary classifiers (batch-incremental) • Can be adapted to clustering, regression, association rule learning • Thousands of options still to explore!

Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

Presentation Transcript

Audio Compression

BWT-Based Compression Algorithms compress better than you have thought

Data Stream Algorithms Intro, Sampling, Entropy

LEACH Low Energy Adaptive Clustering Hierarchy

Chapter 6: Fluvial Landforms

CSC: 345 Computer Architecture

Ch13 statistic (108)

Chapter 10 Image Compression

CMPT 300 Introduction to Operating Systems

Lecture 6. Java I/O

MS108 Computer System I

Bio-inspired locomotion control of hexapods

Concept Hierarchy Induction by Philipp Cimiano

Oracle8i Administration

Memory Hierarchy Design

OpenMP

Streaming in a Connected World: Querying and Tracking Distributed Data Streams

CS590D: Data Mining Chris Clifton

Valence-based Connectivity Coding

Sections 13.1 – 13.3