SPRINT : A Scalable Parallel Classifier for Data Mining

SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta

PATHWAY • Terms • Partition Algorithm • Data Structures • Performing Split • Serial SPRINT • Parallel SPRINT • Results

Terms • Training Data Set • Attributes : Categorical and Continuous • Class Label

Partition Algorithm Partition( Data S ) { if all points in S are in the same class return for each attribute A evaluate split on attribute A find best split partition S into S1 and S2 call Partition( S1 ) call Partition( S2 ) }

Data Structures • Attribute Lists • Histograms : Continuous and Categorical

Finding Split Point Gini(S) = 1 – Sum( Pj*Pj ) Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n

Split on Continuous Attributes • Threshold value : Cabove and Cbelow • Sorted Once and Sequential Scan • Deallocation of Cabove and Cbelow

Split on Categorical Attributes • Create Count-Matrix • All subsets of attribute values as possible split point • Compute Gini Index • Gini from Count Matrix only • Memory deallocation

Perform Split and Partitioning • Select splitting attribute and splitting value • Create two child nodes and divide data on RIDs • Optimization using Hashing <RID,child-ptr> • Optimization depending on number of RIDs • Partitioned Hashing for large hash-table • Create new histogram and count-matrix of children

Parallel SPRINT • Environment : Shared nothing • Data placement and workload balancing • Parallel computation of categorical attribute lists

Repartition of Continuous Attributes • Global Sort • Equal re-partitioning • Relation between Cabove and Cbelow and processor number • Parallel computation of split index

Split point for Categorical Attributes • Create global matrix at coordinator • Compute split-index

Partitioning • Collect RIDs of splitting attributes from processors • Exchange RIDs

Age < 27.5 0 1 2

Attribute List H L Cbelow Position 0 Cabove H L Position 3 Cbelow Cabove

Count Matrix Attribute List H L family sport truck

Breakdown of Response Time

Scaleup of SPRINT

Speedup of SPRINT

Sizeup of SPRINT

Example:Decision Tree Age < 25 CarType=sports High High Low

SPRINT : A Scalable Parallel Classifier for Data Mining

SPRINT : A Scalable Parallel Classifier for Data Mining

Presentation Transcript

Data Mining

AHPCRC SPATIAL DATA-MINING TUTORIAL on Scalable Parallel Formulations of Spatial Auto-Regression (SAR) Models for Mining

Data Mining

A Middleware for Developing and Deploying Scalable Remote Mining Services

Data Mining

Data Mining: An Introduction

Data Mining

Scalable Classification

Data Mining

CHAPTER 17: DATA MINING BASICS

Event Metadata Records as a Testbed for Scalable Data Mining

CHAPTER 17: DATA MINING BASICS

DATA MINING LECTURE 11

Data Mining and Machine Learning

Data Mining : Implementations

CS 347: Parallel and Distributed Data Management Notes X: S4

SLIQ and SPRINT for disk resident data

Scalable Benchmarks and Kernels for Data Mining and Analytics

Data Mining with DB

Spatial and Temporal Data Mining