210 likes | 369 Views
SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label.
E N D
SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta
PATHWAY • Terms • Partition Algorithm • Data Structures • Performing Split • Serial SPRINT • Parallel SPRINT • Results
Terms • Training Data Set • Attributes : Categorical and Continuous • Class Label
Partition Algorithm Partition( Data S ) { if all points in S are in the same class return for each attribute A evaluate split on attribute A find best split partition S into S1 and S2 call Partition( S1 ) call Partition( S2 ) }
Data Structures • Attribute Lists • Histograms : Continuous and Categorical
Finding Split Point Gini(S) = 1 – Sum( Pj*Pj ) Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n
Split on Continuous Attributes • Threshold value : Cabove and Cbelow • Sorted Once and Sequential Scan • Deallocation of Cabove and Cbelow
Split on Categorical Attributes • Create Count-Matrix • All subsets of attribute values as possible split point • Compute Gini Index • Gini from Count Matrix only • Memory deallocation
Perform Split and Partitioning • Select splitting attribute and splitting value • Create two child nodes and divide data on RIDs • Optimization using Hashing <RID,child-ptr> • Optimization depending on number of RIDs • Partitioned Hashing for large hash-table • Create new histogram and count-matrix of children
Parallel SPRINT • Environment : Shared nothing • Data placement and workload balancing • Parallel computation of categorical attribute lists
Repartition of Continuous Attributes • Global Sort • Equal re-partitioning • Relation between Cabove and Cbelow and processor number • Parallel computation of split index
Split point for Categorical Attributes • Create global matrix at coordinator • Compute split-index
Partitioning • Collect RIDs of splitting attributes from processors • Exchange RIDs
Age < 27.5 0 1 2
Attribute List H L Cbelow Position 0 Cabove H L Position 3 Cbelow Cabove
Count Matrix Attribute List H L family sport truck
Example:Decision Tree Age < 25 CarType=sports High High Low