210 likes | 327 Views
Scalable Mining For Classification Rules in Relational Databases. Min Wang Bala Iyer Jeffrey Scott Vitter. מוצג ע”י : נדב גרוסאוג. Abstract. Problem : Increase in Size of Training Set MIND (MINing in Database) Classifier Can be Implemented easily over SQL
E N D
Scalable Mining For Classification Rules in Relational Databases Min Wang Bala Iyer Jeffrey Scott Vitter מוצג ע”י : נדב גרוסאוג
Abstract • Problem : Increase in Size of Training Set • MIND (MINing in Database) Classifier • Can be Implemented easily over SQL • Other Classifiers Need O(N) space In Memory. • MIND Scales Well Over : • I/O • # of Processors
Over View Introduction Algorithm Database Implementation Performance Experimental Results Conclusions
Introduction - Classification Problem DETAIL TABLE CLASSIFYER Age <= 30 yes no salary <= 62K safe yes no risky safe
Introduction - Scalability In Classification Importance Of Scalability: • Use a Very Large Training Set – Data is Not Memory Resident. • Number Of CPUs – better usage of resources.
Introduction - Scalability In Classification Properties of MIND: • Scalable in memory • Scalable In CPU • Uses SQL • Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)
The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr1,attr2,….,class,leaf_num) attri = i attribute class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree)
The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index :
The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T;
Database Implementation - Dimension table • For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attri,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attri Size of Dimi = #leaves * #distinct values of attri * #classes
Database Implementation - Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM1leaf_num,class,attr1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr1 INSERT INTO DIM2leaf_num,class,attr2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr2
Database Implementation - UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d1.leaf_num, d1.attri, d1.class,SUM(d2.count) FROM(FULL OUTER JOIN DIMi d1, DIMid2 ON d1.leaf_num = d2.leaf_num AND d2. attri <= d1. attri AND d1.class = d2.class GROUP BY d1.leaf_num, d1. attri, d1.class
Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW Ck_UP(leaf_num,attri,count) SELECT leaf_num,attri,count FROM UP WHERE class = k
Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attri,gini)AS SELECT u1.leaf_num, u1.attri,ƒgini FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1... ,Cc_DOWN dc WHERE u1.attri = .. = uc. attri = .. = dc. attri AND u1.leaf_num = .. = uc.leaf_num = .. = dc.leaf_num
Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attri,gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num
Database Implementation - BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num
Database Implementation - Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB
Performance I/O cost of MIND: I/O cost of SPRINT:
Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example
Experimental Results Normalized time to build the tree per # of processor Time to build tree By Training Set Size
Conclusions • MIND works over DB • MIND works well because • MIND rephrases the classification to a DB problem • MIND avoid UPDATES the DETAIL table • Parallelism and Scaling Are achived by the use of RDBMS • MIND uses a user function to get the performance gain in the DIMi creation.