210 likes | 343 Views
Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams. Traditional Machine Learning. Create train/test splits of the data (possibly via cross-validation) Load ALL the training data into main memory Compute a model from the training data (may involve multiple passes)
E N D
Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams
Traditional Machine Learning • Create train/test splits of the data (possibly via cross-validation) • Load ALL the training data into main memory • Compute a model from the training data (may involve multiple passes) • Load all the test data into main memory • Compute accuracy measures from the test data
Consequences • Data is processed one instance at a time • Very few incremental methods – none used seriously in practice. • Many existing techniques don’t scale • Machine Learning is perceived as a small to medium data set field of study. • Larger data sets are tackled through sampling or building several models on separate portions and combining their predictions
Data Streams • Takes the “stream” view, source maybe finite or infinite • Concept of train/test less well defined, could train for a while then test for a while – what is the definition of “a while”? • What ever you do you can be sure that ALL the data will NOT fit in main memory
Data Stream Constraints • Cannot store instances (not all anyway) • Cannot use more than available memory – no swapping to disk • Cannot dwell too long over an instance - must keep up with the incoming rate of instances • Cannot wait to make predictions – need to be ready to make predictions at any time
Scaling up existing methods • Could learn models using existing methods in batches and then merge models • Could merge instances (meta-instances) • Could use a cache model where we keep a set of models and update the cache in time – eg use least recently used, least frequently used type strategies • Could do the above but use performance measures to decide the make-up of the cache.
Caching in Data Communications • Web proxy caches provide a good model for what we need to satisfy stream constraints • Real caches are hierarchical (Squid) • The hierarchy provides a mechanism for sharing the load and increasing the likelihood of a page hit • When full a cache needs a replacement policy • To replicate this system we need to design a hierarchy, fill it (with models) and implement a model replacement policy
General CHIC Architecture • Idea: Build a hierarchy of levels (N) as follows: • Level Zero: data buffer from stream • Level One: Build models from data at level zero • Level Two to N-1: Fill with “best” models from lower levels • Level N: Adopt models from level N-1 but also discard models so that new ones can be entered • For prediction use all models in hierarchy and vote
Features • Can use any machine learning algorithm to build models • Implements a form of batch-incremental learning • Replacement policy can be performance based • As with the web cache CHIC fills up initially and then keeps the best performing models • If a variety of models is used at the lower levels then it is possible to adapt to the source of the data.
Experimental Design • Try to demonstrate adaptation to data source • Learn a mixture of models at levels one and two and let the performance based promotion mechanism take over • Evaluate: two issues • Need performance measure (model promotion/deletion) • Overall performance of hierarchy (adopt a first test then train approach)
Data Sources • Random Naïve Bayes • Weight attribute labels and class values (here we use 5 classes, 5 nominal and 5 numeric attributes) • Random Tree • Choose a depth and randomly assign splitter nodes, here 5 nodes deep, leaves starting from depth 3, as above number of attributes/classes. • Random RBF • Random set of centers for each class, center is weighted and has a standard deviation – here 50 centers, 10 numeric attributes and 5 classes • Real data • Forest covertype (UCI repository), 7 classes, 500K instances.
Specific CHIC Architecture • Six levels with 16 models per level • Data buffer of size 1000 instances • First level uses four algorithms to generate models • Naïve Bayes(N), C4.5(J), linear SVM(L), RBF Kernel SVM(R)
Example • Read 1000 instances into buffer, build 4 models, repeat on next 3 buffers – leads to level one full • Read next 1000 and build 4 more models • Using the buffer data as test data evaluate all models at level 1 – promote best 4 (groupwise) to level 2 and replace the worst 4 with the new ones. • Continue in this manner (at 12,000 instances level 2 will be full) • Note: Levels 1&2 always have 4 models from the 4 groups • From level 2 ONLY promote the best (adapt to source)
Example Contd • Four models are promoted to level 3 the 4 worst deleted freeing 8 spaces. • Levels 3, 4 & 5 work on the same basis. Level 6 simply deletes the worst 4 models to free up space. • At prediction time all models have an equal vote and the classification decision is arrived at by majority.
Results after 1000 instances: after 2000 instances: N---J---L---R--- NN--JJ--LL--RR-- after 3000 instances: after 4000 instances: NNN-JJJ-LLL-RRR- NNNNJJJJLLLLRRRR after 5000 instances: after 6000 instances: N-NNJJ-JL-LLR-RR NNNNJJJJLLLLRRRR NJLR------------ NJLR------------ after 7000 instances: after 8000 instances: NNN-J-JJL-LLR-RR NNNNJJJJLLLLRRRR NJLRNJLR-------- NJLRNJLR--------
Continued after 9000 instances: after 10000 instances: NN-N-JJJLL-L-RRR NNNNJJJJLLLLRRRR NJLRNJLRNJLR---- NJLRNJLRNJLR---- after 11000 instances: after 12000 instances: -NNN-JJJLLL-RR-R NNNNJJJJLLLLRRRR NJLRNJLRNJLRNJLR NJLRNJLRNJLRNJLR after 13000 instances: NN-NJJ-JLL-LRR-R NNLJNLLRN-L-N-L- JJJJ------------
Model Adaptation to Source random treerandom naive Bayes N-NN-JJJLL-LR-RR -NNNJJ-JLLL-R-RR NNNNLNNLJLRNNJLR NJJJLLLLRNJJJLLR JJJJJNNLJJ---JL- NNNNNJJNJJ-NN--- JJJJJJJJJJJJJJJJ NNNNNNNNNNNNNNNN JJJJ--JJ-JJJJJ-J NNNNNNNN---NNNN- JJJJJJJJJJJJJJJJ NNNNNNNNNNNNNNNN
Continued random RBFcovertype NNN--JJJLL-LRR-R N-NN-JJJ-LLLR-RR NNJJLRNJNNJJNJLR NNJJJRJLNRJ---R RRJRJJRRRRJRRJJJ LJJLLJLJLJL--N-- RRRRRRRRRRRRRRRR LLLLLLLLLLLLJLLJ RRRRRRRRRRRRRRRR LLLLLLLLJL-J-L-- RRRRRRRRRRRRRRRR LLLLLLLLLLLLLLLL
Conclusion • Novel architecture for data streams • Operates much like a web cache (hierarchy and replacement policy) • Provides scaling-up mechanism for arbitrary classifiers (batch-incremental) • Can be adapted to clustering, regression, association rule learning • Thousands of options still to explore!