Shared Memory Parallelization of Data Mining Algorithms: Techniques and Performance

Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface and Performance Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation • Frequently need to mine very large datasets • Large and powerful SMP machines are becoming available • Vendors often target data mining and data warehousing as the main market • Explicitly writing shared memory programs can be difficult, especially if large datasets need to be processed • Can we provide a common set of techniques and a programming interface to create shared memory implementations ?

Context • Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system • Support parallelization on shared-nothing configurations • Support parallelization on shared memory configurations • Support processing of large datasets • Previously reported our work for distributed memory parallelization and processing of disk-resident datasets (SDM 01, IPDPS 01 workshop) • Focus on techniques and programming interface for shared memory parallelization

Outline • Key observation from mining algorithms • Parallelization challenge, techniques and trade-offs • Programming Interface • Experimental Results • K- means • Apriori • A detailed case study: decision tree construction • Parallel algorithms • Experimental results • Summary and future work

Common Processing Structure • Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } } • Applies to major association mining, clustering and decision tree construction algorithms • How to parallelize it on a shared memory machine?

Challenges in Parallelization • Statically partitioning the reduction object to avoid race conditions is generally impossible. • Runtime preprocessing or scheduling also cannot be applied • Can’t tell what you need to update w/o processing the element • The size of reduction object means significant memory overheads for replication • Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.

Parallelization Techniques • Full Replication: create a copy of the reduction object for each thread • Full Locking: associate a lock with each element • Optimized Full Locking: put the element and corresponding lock on the same cache block • Fixed Locking: use a fixed number of locks • Cache Sensitive Locking: one lock for all elements in a cache block

Memory Layout for Various Locking Schemes Fixed Locking Full Locking Optimized Full Locking Cache-Sensitive Locking Lock Reduction Element

Tradeoffs Among Techniques

Programming Interface: k-means example • Initialization Function void Kmeans::initialize() { for (int i=0;i<k;i++) { clusterID[I]=reducobject->alloc(ndim+2); } {* Initialize Centers *} }

k-means example (contd.) • Local Reduction Function void Kmeans::reduction(void *point) { for (int I=0;I<k;I++) { dis=distance(point,I); if (dis<min) { min=dis; min_index=I; } objectID=clusterID[min_index]; for (int j=0;j<ndim;j++) reductionobject->Add(objectID,j,point[j]); reduction object->Add(objectID,ndim,1); reductionobject->Add(objectID,ndim+1,dis); } }

Implementation from the Common Specification Template<class T> inline void Reducible<T>::Reduc(int objectID, int Offset, void (*func)(void *,void*), int *param) { T* group_address=reducgroup[ObjectID]; switch (TECHNIQUE) { case FULL_REPLICATION: func(group_address[Offset],param); break; case FULL_LOCKING: offset=abs_offset(ObjectID,Offset); S_LOCK(&locks[offset]); func(group_address[Offset],param); S_UNLOCK(&locks[offset]); break; case OPTIMIZED_FULL_LOCKS: S_LOCK(& group_address[Offset*2]); func(group_address[Offset*2+1],param); S_UNLOCK(& group_address[Offset*2]); break; } }

Experimental Platform • Small SMP machine • Sun Ultra Enterprise 450 • 4 X 250 MHz Ultra-II processors • 1 GB of 4-way interleaved main memory • Large SMP machine • Sun Fire 6800 • 24 X 900 MHz Sun UltraSparc III • A 96KB L1 cache and a 64 MB L2 cache per processor • 24 GB main memory

Results (1) 1GB dataset, N1000, L15, support=0.5

Results 500MB dataset, N2000,L20, 4 threads

Results Scalability and Middleware Overhead for Apriori: 4 Processor SMP Machine

Results Scalability and Middleware Overhead for Apriori: Large SMP Machine

Results Scalability and Middleware Overhead for K-means: 4 Process SMP Machine 200MB dataset, k=1000

Results Scalability and Middleware Overhead for K-means: Large SMP Machine

A Case Study: Decision Tree Construction • Question: can we parallelize decision tree construction using the same framework ? • Most existing parallel algorithms have a fairly different structure (sorting, writing back …) • Being able to support decision tree construction will significantly add to the usefulness of the framework

Approach • Implemented RainForest framework (Gehrke) • Currently focus on RF-read • Overview of the algorithm • While the stop condition not satisfied • read the data • build the AVC-group for nodes • choose the splitting attributes to split nodes • select a new set of node to process as long as the main memory could hold it

Parallelization Strategies • Pure approach: only apply one of full replication, optimized full locking and cache-sensitive locking • Vertical approach: use replication at top levels, locking at lower • Horizontal: use replication for attributes with a small number of distinct values, locking otherwise • Mixed approach: combine the above two

Results Performance of pure versions, 1.3GB dataset with 32 million records in the training set, function 7, the depth of decision tree = 16.

Results Combining full replication and full locking

Results Combining full replication and cache-sensitive locking

Summary • A set of common techniques can be used for shared memory parallelization of different mining algorithms • A programming interface can be offered, simplying programming but w/o significant performance overheads • Important to support different parallelization techniques – depending upon the size of reduction object • Excellent performance for apriori and k-means, quite competitive for decision tree construction

Future work • Decision tree construction: beyond RF-read • Other mining algorithms • Performance modeling and prediction (paper at SIGMETRICS 2002) • Release of FREERIDE software

Shared Memory Parallelization of Data Mining Algorithms: Techniques and Performance

Shared Memory Parallelization of Data Mining Algorithms: Techniques and Performance

Presentation Transcript

Weber State University Department of Radiologic Sciences Computer Orientation

College of Computer and Information Sciences Department of Computer Science

Qian Liu, Computer and Information Sciences Department

Xin Huo Department of Computer Science and Engineering The Ohio State University

Ohio State University

The Arts and Sciences at The Ohio State University

1 Department of Geography, The Ohio State University

University of Costa Rica Department of Computer and Information Sciences

Department of Computer and Information Sciences

Department of Computer Science and Engineering The Ohio State University

Department of Information and Computer Sciences

Wei Jiang and Gagan Agrawal

Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Qian Liu, Computer and Information Sciences Department

1 Department of Geography, The Ohio State University

Department of Computer and Information Sciences Postgraduate Study

Department of Information and Computer Sciences

Victor Jin Department of Biomedical Informatics Ohio State University

Kennesaw State University Department of Computer Sciences

John Cavazos Department of Computer and Information Sciences University of Delaware

Ohio State University

Gurmukh Singh, Ph.D. Department of Computer and Information Sciences