270 likes | 281 Views
This paper explores techniques and a programming interface for parallelizing data mining algorithms on shared memory machines, focusing on processing large datasets. The authors present a case study on decision tree construction and provide experimental results.
E N D
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface and Performance Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University
Motivation • Frequently need to mine very large datasets • Large and powerful SMP machines are becoming available • Vendors often target data mining and data warehousing as the main market • Explicitly writing shared memory programs can be difficult, especially if large datasets need to be processed • Can we provide a common set of techniques and a programming interface to create shared memory implementations ?
Context • Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system • Support parallelization on shared-nothing configurations • Support parallelization on shared memory configurations • Support processing of large datasets • Previously reported our work for distributed memory parallelization and processing of disk-resident datasets (SDM 01, IPDPS 01 workshop) • Focus on techniques and programming interface for shared memory parallelization
Outline • Key observation from mining algorithms • Parallelization challenge, techniques and trade-offs • Programming Interface • Experimental Results • K- means • Apriori • A detailed case study: decision tree construction • Parallel algorithms • Experimental results • Summary and future work
Common Processing Structure • Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } } • Applies to major association mining, clustering and decision tree construction algorithms • How to parallelize it on a shared memory machine?
Challenges in Parallelization • Statically partitioning the reduction object to avoid race conditions is generally impossible. • Runtime preprocessing or scheduling also cannot be applied • Can’t tell what you need to update w/o processing the element • The size of reduction object means significant memory overheads for replication • Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.
Parallelization Techniques • Full Replication: create a copy of the reduction object for each thread • Full Locking: associate a lock with each element • Optimized Full Locking: put the element and corresponding lock on the same cache block • Fixed Locking: use a fixed number of locks • Cache Sensitive Locking: one lock for all elements in a cache block
Memory Layout for Various Locking Schemes Fixed Locking Full Locking Optimized Full Locking Cache-Sensitive Locking Lock Reduction Element
Programming Interface: k-means example • Initialization Function void Kmeans::initialize() { for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim+2); } {* Initialize Centers *} }
k-means example (contd.) • Local Reduction Function void Kmeans::reduction(void *point) { for (int I=0;I<k;I++) { dis=distance(point,I); if (dis<min) { min=dis; min_index=I; } objectID=clusterID[min_index]; for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); } }
Implementation from the Common Specification Template<class T> inline void Reducible<T>::Reduc(int objectID, int Offset, void (*func)(void *,void*), int *param) { T* group_address=reducgroup[ObjectID]; switch (TECHNIQUE) { case FULL_REPLICATION: func(group_address[Offset],param); break; case FULL_LOCKING: offset=abs_offset(ObjectID,Offset); S_LOCK(&locks[offset]); func(group_address[Offset],param); S_UNLOCK(&locks[offset]); break; case OPTIMIZED_FULL_LOCKS: S_LOCK(& group_address[Offset*2]); func(group_address[Offset*2+1],param); S_UNLOCK(& group_address[Offset*2]); break; } }
Experimental Platform • Small SMP machine • Sun Ultra Enterprise 450 • 4 X 250 MHz Ultra-II processors • 1 GB of 4-way interleaved main memory • Large SMP machine • Sun Fire 6800 • 24 X 900 MHz Sun UltraSparc III • A 96KB L1 cache and a 64 MB L2 cache per processor • 24 GB main memory
Results (1) 1GB dataset, N1000, L15, support=0.5
Results 500MB dataset, N2000,L20, 4 threads
Results Scalability and Middleware Overhead for Apriori: 4 Processor SMP Machine
Results Scalability and Middleware Overhead for Apriori: Large SMP Machine
Results Scalability and Middleware Overhead for K-means: 4 Process SMP Machine 200MB dataset, k=1000
Results Scalability and Middleware Overhead for K-means: Large SMP Machine
A Case Study: Decision Tree Construction • Question: can we parallelize decision tree construction using the same framework ? • Most existing parallel algorithms have a fairly different structure (sorting, writing back …) • Being able to support decision tree construction will significantly add to the usefulness of the framework
Approach • Implemented RainForest framework (Gehrke) • Currently focus on RF-read • Overview of the algorithm • While the stop condition not satisfied • read the data • build the AVC-group for nodes • choose the splitting attributes to split nodes • select a new set of node to process as long as the main memory could hold it
Parallelization Strategies • Pure approach: only apply one of full replication, optimized full locking and cache-sensitive locking • Vertical approach: use replication at top levels, locking at lower • Horizontal: use replication for attributes with a small number of distinct values, locking otherwise • Mixed approach: combine the above two
Results Performance of pure versions, 1.3GB dataset with 32 million records in the training set, function 7, the depth of decision tree = 16.
Results Combining full replication and full locking
Results Combining full replication and cache-sensitive locking
Summary • A set of common techniques can be used for shared memory parallelization of different mining algorithms • A programming interface can be offered, simplying programming but w/o significant performance overheads • Important to support different parallelization techniques – depending upon the size of reduction object • Excellent performance for apriori and k-means, quite competitive for decision tree construction
Future work • Decision tree construction: beyond RF-read • Other mining algorithms • Performance modeling and prediction (paper at SIGMETRICS 2002) • Release of FREERIDE software