1 / 21

Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams. Traditional Machine Learning. Create train/test splits of the data (possibly via cross-validation) Load ALL the training data into main memory Compute a model from the training data (may involve multiple passes)

tomai
Download Presentation

Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Hierarchy Inspired Compression: A Novel Architecture for Data Streams

  2. Traditional Machine Learning • Create train/test splits of the data (possibly via cross-validation) • Load ALL the training data into main memory • Compute a model from the training data (may involve multiple passes) • Load all the test data into main memory • Compute accuracy measures from the test data

  3. Consequences • Data is processed one instance at a time • Very few incremental methods – none used seriously in practice. • Many existing techniques don’t scale • Machine Learning is perceived as a small to medium data set field of study. • Larger data sets are tackled through sampling or building several models on separate portions and combining their predictions

  4. Data Streams • Takes the “stream” view, source maybe finite or infinite • Concept of train/test less well defined, could train for a while then test for a while – what is the definition of “a while”? • What ever you do you can be sure that ALL the data will NOT fit in main memory

  5. Data Stream Constraints • Cannot store instances (not all anyway) • Cannot use more than available memory – no swapping to disk • Cannot dwell too long over an instance - must keep up with the incoming rate of instances • Cannot wait to make predictions – need to be ready to make predictions at any time

  6. Scaling up existing methods • Could learn models using existing methods in batches and then merge models • Could merge instances (meta-instances) • Could use a cache model where we keep a set of models and update the cache in time – eg use least recently used, least frequently used type strategies • Could do the above but use performance measures to decide the make-up of the cache.

  7. Caching in Data Communications • Web proxy caches provide a good model for what we need to satisfy stream constraints • Real caches are hierarchical (Squid) • The hierarchy provides a mechanism for sharing the load and increasing the likelihood of a page hit • When full a cache needs a replacement policy • To replicate this system we need to design a hierarchy, fill it (with models) and implement a model replacement policy

  8. General CHIC Architecture • Idea: Build a hierarchy of levels (N) as follows: • Level Zero: data buffer from stream • Level One: Build models from data at level zero • Level Two to N-1: Fill with “best” models from lower levels • Level N: Adopt models from level N-1 but also discard models so that new ones can be entered • For prediction use all models in hierarchy and vote

  9. Features • Can use any machine learning algorithm to build models • Implements a form of batch-incremental learning • Replacement policy can be performance based • As with the web cache CHIC fills up initially and then keeps the best performing models • If a variety of models is used at the lower levels then it is possible to adapt to the source of the data.

  10. Experimental Design • Try to demonstrate adaptation to data source • Learn a mixture of models at levels one and two and let the performance based promotion mechanism take over • Evaluate: two issues • Need performance measure (model promotion/deletion) • Overall performance of hierarchy (adopt a first test then train approach)

  11. Data Sources • Random Naïve Bayes • Weight attribute labels and class values (here we use 5 classes, 5 nominal and 5 numeric attributes) • Random Tree • Choose a depth and randomly assign splitter nodes, here 5 nodes deep, leaves starting from depth 3, as above number of attributes/classes. • Random RBF • Random set of centers for each class, center is weighted and has a standard deviation – here 50 centers, 10 numeric attributes and 5 classes • Real data • Forest covertype (UCI repository), 7 classes, 500K instances.

  12. Specific CHIC Architecture • Six levels with 16 models per level • Data buffer of size 1000 instances • First level uses four algorithms to generate models • Naïve Bayes(N), C4.5(J), linear SVM(L), RBF Kernel SVM(R)

  13. Example • Read 1000 instances into buffer, build 4 models, repeat on next 3 buffers – leads to level one full • Read next 1000 and build 4 more models • Using the buffer data as test data evaluate all models at level 1 – promote best 4 (groupwise) to level 2 and replace the worst 4 with the new ones. • Continue in this manner (at 12,000 instances level 2 will be full) • Note: Levels 1&2 always have 4 models from the 4 groups • From level 2 ONLY promote the best (adapt to source)

  14. Example Contd • Four models are promoted to level 3 the 4 worst deleted freeing 8 spaces. • Levels 3, 4 & 5 work on the same basis. Level 6 simply deletes the worst 4 models to free up space. • At prediction time all models have an equal vote and the classification decision is arrived at by majority.

  15. Results after 1000 instances: after 2000 instances: N---J---L---R--- NN--JJ--LL--RR-- after 3000 instances: after 4000 instances: NNN-JJJ-LLL-RRR- NNNNJJJJLLLLRRRR after 5000 instances: after 6000 instances: N-NNJJ-JL-LLR-RR NNNNJJJJLLLLRRRR NJLR------------ NJLR------------ after 7000 instances: after 8000 instances: NNN-J-JJL-LLR-RR NNNNJJJJLLLLRRRR NJLRNJLR-------- NJLRNJLR--------

  16. Continued after 9000 instances: after 10000 instances: NN-N-JJJLL-L-RRR NNNNJJJJLLLLRRRR NJLRNJLRNJLR---- NJLRNJLRNJLR---- after 11000 instances: after 12000 instances: -NNN-JJJLLL-RR-R NNNNJJJJLLLLRRRR NJLRNJLRNJLRNJLR NJLRNJLRNJLRNJLR after 13000 instances: NN-NJJ-JLL-LRR-R NNLJNLLRN-L-N-L- JJJJ------------

  17. Model Adaptation to Source random treerandom naive Bayes N-NN-JJJLL-LR-RR -NNNJJ-JLLL-R-RR NNNNLNNLJLRNNJLR NJJJLLLLRNJJJLLR JJJJJNNLJJ---JL- NNNNNJJNJJ-NN--- JJJJJJJJJJJJJJJJ NNNNNNNNNNNNNNNN JJJJ--JJ-JJJJJ-J NNNNNNNN---NNNN- JJJJJJJJJJJJJJJJ NNNNNNNNNNNNNNNN 

  18. Continued random RBFcovertype NNN--JJJLL-LRR-R N-NN-JJJ-LLLR-RR NNJJLRNJNNJJNJLR NNJJJRJLNRJ---R RRJRJJRRRRJRRJJJ LJJLLJLJLJL--N-- RRRRRRRRRRRRRRRR LLLLLLLLLLLLJLLJ RRRRRRRRRRRRRRRR LLLLLLLLJL-J-L-- RRRRRRRRRRRRRRRR LLLLLLLLLLLLLLLL

  19. Learning Curve – Random Tree

  20. Learning Curve - CoverType

  21. Conclusion • Novel architecture for data streams • Operates much like a web cache (hierarchy and replacement policy) • Provides scaling-up mechanism for arbitrary classifiers (batch-incremental) • Can be adapted to clustering, regression, association rule learning • Thousands of options still to explore!

More Related