180 likes | 267 Views
Implementation of GUHA decision trees – initial remarks. Martin Ralbovský KIZI FIS VŠE 6.12.2007. The GUHA method. Provides a general mainframe for retrieving interesting information from data Strong foundations in logics and statistics
E N D
Implementation of GUHA decision trees – initial remarks Martin Ralbovský KIZI FIS VŠE 6.12.2007
The GUHA method • Provides a general mainframe for retrieving interesting information from data • Strong foundations in logics and statistics • One of the main principles of the method is to provide “everything interesting” to the user.
Decision trees • One of the most known classification methods • There are several known algorithms for construction of decision trees (ID3, C4.5…) • Algorithm outline: Iterate through attributes, in each step choose the best attribute for branching and make node from the attribute • Best decision tree in output
Making decision trees GUHA (PetrBerka) • A decision tree can be viewed as a GUHA verification/hypothesis • But there is only 1 tree in the output • Modification of the initial algorithm – ETree procedure • We do not branch according to best attribute, but according to n best attributes • In each iteration, nodes suitable for branching from existing trees are selected and branched • Only sound decision trees to output
ETree parameters (PetrBerka) • Criterion for attribute ordering - 2 Trees: • Maximal tree depth (parameter) • Allow only full length trees • Number of attributes for branching Branching: • Minimal node frequency • Minimal node purity • Stopping branching criterion (frequency, purity, frequency OR purity) Sound trees: • Confusion matrix, F-Measure + any 4ft-quantifier in Ferda
How to branch I Attribute1 = {A,B,C} Attribute2 = {1,2} Ad b) A B C Ad a) A B C A B C A B C A B C A B C
How to branch II Attribute1 = {A,B,C} Attribute2 = {1,2} Ad a) Ad b) A B C A B C A B C A B C A B C 1 2 2 1 2 1
Pseudocode algorithm LIFO stack; stack.Push(MakeSeedTree()); while (stack.Length >= 0) { Tree processTree = stack.Pop(); foreach (Node n in NodesForBranching(processTree) { stack.Push(CreateTree(processTree,n); } if (QualityTree(processTree)) { PutToOutPut(processTree); } }
Implementation in Ferda Instead of creating a new DM tool, modularity of Ferda was used • Data preparation boxes • 4ft-quantifiers can be used to measure quality of trees • MiningProcessor (bit string generation engine) usage
ETree task example 4ft-quantifiers ETree task box Existing data preparation boxes
Experiment 1 - Barbora Barbora bank, cca. 6100 cliends, classification of client status from: • Loan amount • Client district • Loan duration • Client Salary Number of attributes for branching = 4 Minimal node purity = 0.8 Minimal node frequency = 61 (1% of data)
Results - Barbora Performance: 36 verifications/sec
Experiment 2: Forest tree cover UCI KDD Dataset for classification (10K sample) Classification of tree cover based on characteristics: • Wilderness area • Elevation • Slope • Horizontal + vertical distance to hydrology • Horizontal distance to fire point Number of attributes for branching : 1,3,5 Minimal node purity: 0.8 Minimal node frequency: 100 (1% of dataset)
Results – Forest tree cover Attributes for branching: 1 Performance: 39VPS Attributes for branching: 3 Performance: 86VPS Attributes for branching: 5 Performance: 71VPS
Experiment 3: Forest tree cover • Construction of trees for whole dataset (cca. 600K) • Does increase of number attributes for branching result in better trees? • Tree length = 3 the other parameters same as in experiment 2. Number of attributes for branching = 1 • Best hypothesis: 0.30, 6VPS (strings in cache) Number of attributes for branching = 4 • Best hypothesis: 0.52, 2VPS(strings in cache)
Verifications 4FT vs. ETree On tasks on similar data table length • 4FT (in Ferda) approx. 5000 VPS • ETree about 70 VPS The ETree verification is far more complicated: • In addition to computing quantifier, counting 2 for each node suitable for branching • Hard operations (sums) instead of easy operations (conjunctions…) • Not only verification of a tree, but construction of trees derived from this tree
Further work • How new/known is the method? • Boxes for attribute selection criteria • Classification box • Better result browsing + result reduction • Optimizing • Elective classification - tree has a vote (PetrBerka) • Experiments with various data sources • Decision trees from fuzzy attributes • Better estimation of relevant questions count