1 / 27

Shared Memory Parallelization of Data Mining Algorithms: Techniques and Performance

This paper explores techniques and a programming interface for parallelizing data mining algorithms on shared memory machines, focusing on processing large datasets. The authors present a case study on decision tree construction and provide experimental results.

saulsj
Download Presentation

Shared Memory Parallelization of Data Mining Algorithms: Techniques and Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface and Performance Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

  2. Motivation • Frequently need to mine very large datasets • Large and powerful SMP machines are becoming available • Vendors often target data mining and data warehousing as the main market • Explicitly writing shared memory programs can be difficult, especially if large datasets need to be processed • Can we provide a common set of techniques and a programming interface to create shared memory implementations ?

  3. Context • Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system • Support parallelization on shared-nothing configurations • Support parallelization on shared memory configurations • Support processing of large datasets • Previously reported our work for distributed memory parallelization and processing of disk-resident datasets (SDM 01, IPDPS 01 workshop) • Focus on techniques and programming interface for shared memory parallelization

  4. Outline • Key observation from mining algorithms • Parallelization challenge, techniques and trade-offs • Programming Interface • Experimental Results • K- means • Apriori • A detailed case study: decision tree construction • Parallel algorithms • Experimental results • Summary and future work

  5. Common Processing Structure • Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } } • Applies to major association mining, clustering and decision tree construction algorithms • How to parallelize it on a shared memory machine?

  6. Challenges in Parallelization • Statically partitioning the reduction object to avoid race conditions is generally impossible. • Runtime preprocessing or scheduling also cannot be applied • Can’t tell what you need to update w/o processing the element • The size of reduction object means significant memory overheads for replication • Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.

  7. Parallelization Techniques • Full Replication: create a copy of the reduction object for each thread • Full Locking: associate a lock with each element • Optimized Full Locking: put the element and corresponding lock on the same cache block • Fixed Locking: use a fixed number of locks • Cache Sensitive Locking: one lock for all elements in a cache block

  8. Memory Layout for Various Locking Schemes Fixed Locking Full Locking Optimized Full Locking Cache-Sensitive Locking Lock Reduction Element

  9. Tradeoffs Among Techniques

  10. Programming Interface: k-means example • Initialization Function void Kmeans::initialize() { for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim+2); } {* Initialize Centers *} }

  11. k-means example (contd.) • Local Reduction Function void Kmeans::reduction(void *point) { for (int I=0;I<k;I++) { dis=distance(point,I); if (dis<min) { min=dis; min_index=I; } objectID=clusterID[min_index]; for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); } }

  12. Implementation from the Common Specification Template<class T> inline void Reducible<T>::Reduc(int objectID, int Offset, void (*func)(void *,void*), int *param) { T* group_address=reducgroup[ObjectID]; switch (TECHNIQUE) { case FULL_REPLICATION: func(group_address[Offset],param); break; case FULL_LOCKING: offset=abs_offset(ObjectID,Offset); S_LOCK(&locks[offset]); func(group_address[Offset],param); S_UNLOCK(&locks[offset]); break; case OPTIMIZED_FULL_LOCKS: S_LOCK(& group_address[Offset*2]); func(group_address[Offset*2+1],param); S_UNLOCK(& group_address[Offset*2]); break; } }

  13. Experimental Platform • Small SMP machine • Sun Ultra Enterprise 450 • 4 X 250 MHz Ultra-II processors • 1 GB of 4-way interleaved main memory • Large SMP machine • Sun Fire 6800 • 24 X 900 MHz Sun UltraSparc III • A 96KB L1 cache and a 64 MB L2 cache per processor • 24 GB main memory

  14. Results (1) 1GB dataset, N1000, L15, support=0.5

  15. Results 500MB dataset, N2000,L20, 4 threads

  16. Results Scalability and Middleware Overhead for Apriori: 4 Processor SMP Machine

  17. Results Scalability and Middleware Overhead for Apriori: Large SMP Machine

  18. Results Scalability and Middleware Overhead for K-means: 4 Process SMP Machine 200MB dataset, k=1000

  19. Results Scalability and Middleware Overhead for K-means: Large SMP Machine

  20. A Case Study: Decision Tree Construction • Question: can we parallelize decision tree construction using the same framework ? • Most existing parallel algorithms have a fairly different structure (sorting, writing back …) • Being able to support decision tree construction will significantly add to the usefulness of the framework

  21. Approach • Implemented RainForest framework (Gehrke) • Currently focus on RF-read • Overview of the algorithm • While the stop condition not satisfied • read the data • build the AVC-group for nodes • choose the splitting attributes to split nodes • select a new set of node to process as long as the main memory could hold it

  22. Parallelization Strategies • Pure approach: only apply one of full replication, optimized full locking and cache-sensitive locking • Vertical approach: use replication at top levels, locking at lower • Horizontal: use replication for attributes with a small number of distinct values, locking otherwise • Mixed approach: combine the above two

  23. Results Performance of pure versions, 1.3GB dataset with 32 million records in the training set, function 7, the depth of decision tree = 16.

  24. Results Combining full replication and full locking

  25. Results Combining full replication and cache-sensitive locking

  26. Summary • A set of common techniques can be used for shared memory parallelization of different mining algorithms • A programming interface can be offered, simplying programming but w/o significant performance overheads • Important to support different parallelization techniques – depending upon the size of reduction object • Excellent performance for apriori and k-means, quite competitive for decision tree construction

  27. Future work • Decision tree construction: beyond RF-read • Other mining algorithms • Performance modeling and prediction (paper at SIGMETRICS 2002) • Release of FREERIDE software

More Related