RainForest – A Framework for Fast Decision Tree Construction of Large Datasets

RainForest– A Framework for Fast Decision Tree Construction of Large Datasets Authors: Johannes Gehrke Raghu Ramakrishnan Venkatesh Ganti Presented by: Xin Li & Omid Rouhani

Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down Decision Tree Induction Schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

The Very First Impression of RainForest Q: What’s in a rainforest? (Monteverde Costa Rica rainforest)

The Very First Impression of RainForest Q: What’s in a rainforest? A: Trees! All kinds of trees, and they all grow fast in the rainforest! (Monteverde Costa Rica rainforest)

The Very First Impression of RainForest Q: What’s in a rainforest? A: Trees! All kinds of trees, and they all grow fast in the rainforest! Similarly, RainForest is a unifying framework for decision tree classifiers, under which we can independently deal with the scalability issue from the quality issue of the tree construction algorithms. (Monteverde Costa Rica rainforest)

Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down Decision Tree Induction Schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

Decision Tree Classifiers (1)-- Formal problem definition • Family = F(n)All records in the databasethat corresponds to node n

Decision Tree Classifiers (1)-- Formal problem definition • Family = F(n)Example 1 Age > 20 Age <=20 F(n) = All entries in database corresponding to people over 20 year

Decision Tree Classifiers (1)-- Formal problem definition F(n) = Entire database • Family = F(n)Example 2

Decision Tree Classifiers (1)-- Formal problem definition • Splitting criteria = crit(n) • What attribute to split • What values each branch corresponds to Example Age > 20 Age <=20

Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted

Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Database (large) Stored in memory (small)

Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Sorted Database (large) Stored in memory (small)

Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Still Sorted Database (large) Split age >20

Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Basic features of Sprint: • Creates binary trees • Removes all relationships between size of datasets and main memory • Requires access to F(n) sorted Attribute lists Database (large) Split age >20

Decision Tree Classifiers (2)-- Dealing with large databases: Sprint • Summary Sprint algorithm • Binary cuts • Scales well for large databases • With Rainforest framework • We can use other algorithms (C4.5, ID3, FACT etc.) • Also scale up well to large databases

Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down decision tree induction schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

Database The RainForest Framework (1)-- Top-down decision tree induction schema • At the root r, examine the database and compute the best crit(r). r Crit(r) n

Database The RainForest Framework (1)-- Top-down decision tree induction schema • At the root r, examine the database and compute the best crit(r). • Recursively, at a non-root node n, examine F(n) and compute crit(n) until the class label of F(n) can be determined. r r Crit(r) n n Crit(n)

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif Step1: decide the splitting criterion

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif Step2: create the child partitions

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif Step3: recursively build sub-trees

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif generalization

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 1: All of existing decision tree algorithms (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, Sprint and QUEST) proceed according to this genetic schema. • Top-Down Decision Tree Induction Schema • Input: node n, partition D, classification algorithm CL • Ouptput: decision tree for D rooted at n • BuildTree(Node n, dataPartition D, algorithm CL) • apply CL to D to find crit(n) • let k be the number of children of n • if(k>0) • Create k children c1, …, ck of n • Use best split to partition D into D1, …, Dk • for (i=1; i≤k; i++) • BuildTree(ci, Di) • endfor • endif RainForest Refinement BuildTree(Node n, dataPartition D, algorithm CL) (1a) for each predictor attribute p (1b) call CL.find_best_partitioning(AVC-set of p) (1c) endfor (2a) k = CL.decide_splitting_criterion();

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 2: At each node n, the utility of a predictor attribute A as a possible splitting attribute is examined independent of other attributes. How well can the data be separated with A? Utility(Age) = 2, Utility($$) = 1

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 2: At each node n, the utility of a predictor attribute A as a possible splitting attribute is examined independent of other attributes. How well can the data be separated with A? Utility(Age) = 2, Utility($$) = 1 crit(n) = bestPartion(Age)

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 3: At each node n, to compute the utility of a predictor attribute A as a possible splitting attribute, the information about the class label distribution for each distinct attribute value of A is sufficient. Class distribution of Age Utility(Age)

The RainForest Framework (1)-- Top-down decision tree induction schema • Finding 3: At each node n, to compute the utility of a predictor attribute A as a possible splitting attribute, the information about the class label distribution for each distinct attribute value of A is sufficient. Class distribution of Age AVC-set Utility(Age)

The RainForest Framework (2)-- RainForest refinement • AVC-set of a predictor attribute A at node n: • The projection of F(n) onto A and the class label whereby counts of the individual class labels are aggregated. • AVC = Attribute-Value, Classlabel F(n) AVC-set($$) AVC-set(Age)

The RainForest Framework (2)-- RainForest refinement • AVC-group of a node n: the set of all AVC-sets at node n r n1 n2 AVC-group(r) AVC-group(n1) AVC-group(n2) AVC-set(age) AVC-set($$)

The RainForest Framework (2)-- RainForest refinement • Refined RainForest Schema RainForest Refinement BuildTree(Node n, dataPartition D, algorithm CL) (1a) for each predictor attribute p (1b) call CL.find_best_partitioning(AVC-set of p) (1c) endfor (2a) k = CL.decide_splitting_criterion(); Compute the utility for p Decide the splitting criterion

The RainForest Framework (2)-- RainForest refinement • Refined RainForest Schema RainForest Refinement BuildTree(Node n, dataPartition D, algorithm CL) (1a) for each predictor attribute p (1b) call CL.find_best_partitioning(AVC-set of p) (1c) endfor (2a) k = CL.decide_splitting_criterion(); Seperated!

AVC group AVC set 1 Database AVC set 2 AVC set 3 The RainForest Framework (2)-- RainForest refinement • Main memory requirement of the genetic RainForest Tree induction schema is determined by AVC-sets RAM Memory C4.5 with RainForest schema C4.5 Decision Tree

AVC group AVC set 1 Database AVC set 2 AVC set 3 The RainForest Framework (2)-- RainForest refinement • Main memory requirement of the genetic RainForest Tree induction schema is determined by AVC-sets RAM Memory C4.5 with RainForest schema C4.5 Decision Tree proportional to proportional to the # of distinct attribute values and the # of class labels in F(n) the # of data records in F(n)

The RainForest Framework (2)-- RainForest refinement • Fit the AVC-group into the main memory • For most real-life datasets, AVC-group(r) are expected to fit entirely in the main memory • OR, at least each single AVC-set of the root node fits in main memory • The size of AVC-sets of non-root nodes will be bounded by the that of the root node • Different algorithms are proposed depending on the amount of available main memory

The RainForest Framework (3)-- Main steps of RainForest algorithms • For each tree node n • Step1: AVC-group construction • Need one scan of the data partition • Step2: Choose splitting attribute and partition criterion on the attribute • Computation is based on the AVC-sets • Step3: partition D across the children nodes • Read and write once, partitioning D into child “buckets” • If memory is sufficient, build AVC-groups for one or more children at the same time as an optimization

The RainForest Framework (3)-- Compare to Sprint

Outline • The very first impression of the RainForest • Decision tree classifiers • Formal problem definition • Dealing with large databases: Sprint • The RainForest framework • Top-Down decision tree induction schema • RainForest refinement • Main steps of RainForest algorithms • The RainForest family of algorithms • RF-Write, RF-Read, RF-Hybrid, RF-Vertical • Experimental results • Conclusions

Algorithms in RainForest (1)-- Overview • RF-Write, RF-Read and RF-Hybrid • Requires thatAVC-group ofroot node fitsin memory • RF-Vertical • Requires thateach AVC-setof root node fitsin memory AVC sets RAM Memory Set #1 # 2 # 3 AVC group RAM Memory Set #1 # 2 # 3

Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Database

Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

Algorithms in RainForest (1)-- RF-Write k = 2 for binary tree • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. For example C4.5 or ID3 Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. We write to the database Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

Algorithms in RainForest (1)-- RF-Write • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

Algorithms in RainForest (1)-- RF-Write Read database 2 times • Scan database and construct AVC-group for root. • Use underlying algorithm to create k partitions. • Scan database and assign records to each partition. • Recurs on each child node. Write to database once Database AVC group AVC set 1 Scan 1 AVC set 2 AVC set 3

Algorithms in RainForest (2)-- RF-Read • Does not write to memory. • Only reads from memory.

RainForest – A Framework for Fast Decision Tree Construction of Large Datasets

RainForest – A Framework for Fast Decision Tree Construction of Large Datasets

Presentation Transcript