Scalable Mining For Classification Rules in Relational Databases

Scalable Mining For Classification Rules in Relational Databases Min Wang Bala Iyer Jeffrey Scott Vitter מוצג ע”י : נדב גרוסאוג

Abstract • Problem : Increase in Size of Training Set • MIND (MINing in Database) Classifier • Can be Implemented easily over SQL • Other Classifiers Need O(N) space In Memory. • MIND Scales Well Over : • I/O • # of Processors

Over View Introduction Algorithm Database Implementation Performance Experimental Results Conclusions

Introduction - Classification Problem DETAIL TABLE CLASSIFYER Age <= 30 yes no salary <= 62K safe yes no risky safe

Introduction - Scalability In Classification Importance Of Scalability: • Use a Very Large Training Set – Data is Not Memory Resident. • Number Of CPUs – better usage of resources.

Introduction - Scalability In Classification Properties of MIND: • Scalable in memory • Scalable In CPU • Uses SQL • Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)

The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr1,attr2,….,class,leaf_num) attri = i attribute class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree)

The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index :

The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T;

Database Implementation - Dimension table • For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attri,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attri Size of Dimi = #leaves * #distinct values of attri * #classes

Database Implementation - Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM1leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr1 INSERT INTO DIM2leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr2

Database Implementation - UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d1.leaf_num, d1.attri, d1.class,SUM(d2.count) FROM(FULL OUTER JOIN DIMi d1, DIMid2 ON d1.leaf_num = d2.leaf_num AND d2. attri <= d1. attri AND d1.class = d2.class GROUP BY d1.leaf_num, d1. attri, d1.class

Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW Ck_UP(leaf_num,attri,count) SELECT leaf_num,attri,count FROM UP WHERE class = k

Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attri,gini)AS SELECT u1.leaf_num, u1.attri,ƒgini FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1... ,Cc_DOWN dc WHERE u1.attri = .. = uc. attri = .. = dc. attri AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num

Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attri,gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num

Database Implementation - BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num

Database Implementation - Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB

Performance I/O cost of MIND: I/O cost of SPRINT:

Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example

Experimental Results Normalized time to build the tree per # of processor Time to build tree By Training Set Size

Conclusions • MIND works over DB • MIND works well because • MIND rephrases the classification to a DB problem • MIND avoid UPDATES the DETAIL table • Parallelism and Scaling Are achived by the use of RDBMS • MIND uses a user function to get the performance gain in the DIMi creation.

Scalable Mining For Classification Rules in Relational Databases

Scalable Mining For Classification Rules in Relational Databases

Presentation Transcript

Mining Association Rules in Large Databases

Mining Association Rules in Large Databases

Mining Quantitative Association Rules in Large Relational Tables

Relational Databases

Relational Databases

Mining Association Rules in Large Databases

Relational Databases

Mining Association Rules in Large Databases

Mining Quantitative Association Rules in Large Relational Databases

RELATIONAL DATABASES

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Mining Quantitative Association Rules in Large Relational Tables

Relational Databases

Mining Association Rules in Large Databases

Mining Association Rules in Large Databases

Relational Databases