260 likes | 472 Views
Tree-based methods, neutral networks. Lecture 10. Tree-based methods. Statistical methods in which the input space (feature space) is partitioned into a set of cuboids (rectangles), and then a simple model is set up in each one. Why decision trees. Compact representation of data
E N D
Tree-based methods, neutral networks Lecture 10
Tree-based methods Statistical methods in which the input space (feature space) is partitioned into a set of cuboids (rectangles), and then a simple model is set up in each one
Why decision trees • Compact representation of data • Possibility to predict outcome of new observations
Tree structure • Root • Nodes • Leaves (terminal nodes) • Parent-child relationship • Condition • Label is assigned to a leaf Cond.1 Cond.2 Cond.3 Cond.4 Cond.6 Cond.5 N4 N5 N6 N7
Example Body temperature? Root node Warm Cold Internal node Non-mammals Gives birth? Yes No Leaf nodes Mammals Non-mammals
How to build a decision tree: Hunt’s algorithm Proc Hunt(Dt,t) • Given Data set Dt={(X1i,..Xpi, Yi), i=1..n}, t-curr.node • If all Yi are equal mark t as leaf with label Yi • If not, use the test condition to split into Dt1…Dtn, create children t1…tn and run Hunt(Dt1,t1),…, Hunt(Dtn,tn)
Hunt’s algorithm example 20 X1 <9 >=9 X2 X2 <16 >=16 <7 >=7 0 1 1 X1 10 <15 >=15 1 0 0 10 20
Hunt’s algorithmWhat if some combinations of attributes are missing? Empty node Is assigned the label representing the majority class among the records (instances, objects, cases) in its parent node. All records in a node have identical attributes The node is declared a leaf node with the same class label as the majority class of this node
CART: Classification and regression trees Regression trees • Given Dt={(X1i,..Xpi, Yi), i=1..n}, Y – continuous, build a tree that will fit the data best Classification trees • Given Dt={(X1i,..Xpi, Yi), i=1..n}, Y – categorical, build a tree that will classify the observations best
A CART algorithm: Regression trees Aim: Want to find ; computationally expensive to test all possible splits. Instead Splitting variables and split points Consider a splitting variable j and a split point s, and define the pair of half planes We seek the splitting variable j and split point s that solve
Post-pruning How large tree to grow? Too large – overfitting! Grow a large tree T0 Then prune this tree using post-pruning Define a subtree T and index its terminal nodes by m, with node m representing region Rm. Let |T| denote the number of terminal nodes in T and set where Then minimize this expression, using cross-validation to select the factor that penalizes complex trees.
CART: Classification trees • For each node define proportions • Define measure of impurity
Design issues of decision tree induction How to split the training records We need a measure for evaluating the goodness of various test conditions How to terminate the splitting procedure 1) Continue expanding nodes until either all the records belong to the same class or all the records have identical attribute values 2) Define criteria for early termination
How to split: CART Select splitting with max information gain where I(.) is the impurity measure of a given node, N is the total number of records at the parent node, and N(vj) is the is the number of records associated with the child node vj
How to split: C4.5 Impurity measures such as Gini index tend to favour attributes that have a large number of distinct values Strategy 1: Restrict the test conditions to binary splits only Strategy 2: Use the gain ratio as splitting criterion
Constructing decision trees Home owner Yes No Defaulted = No Marital status Not married Married Defaulted = No Income ≤ 100 K > 100 K Defaulted = ?
Expressing attribute test conditions Binary attributes Binary splits Nominal attributes Binary or multiway splits Ordinal attributes Binary or multiway splits honoring the order of the attributes Continuous attributes Binary or multiway splits into disjoint interval
Characteristics of decision tree induction • Nonparametric approach (no underlying probability model) • Computationally inexpensive techniques have been dveloped for constructing decision trees. Once a decision tree has been built, classification is extremely fast • The presence of redundant attributes will not adversely affect the accuracy of decision trees • The presence of irrelevant attributes can lower the accuracy of decision trees, especially if no measures are taken to avoid overfitting • At the leaf nodes, the number of records may be too small (data fragmentation)
Neural networks • Joint theoretical framework for prediction and classification
Principal components regression (PCR) Extract principal components (transformation of the inputs) as derived features, and then model the target (response) as a linear function of these features y … z1 z2 zM … x1 x2 xp
Neural networks with a single target Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function of these features y z1 z2 … zM … x1 x1 xp
Artificial neural networks Introduction from biology: • Neurons • Axons • Dendrites • Synapse Capabilities of neural networks: • Memorization (noise stable, fragmentary stable!) • Classification
Terminology Feed-forward neural network • Input layer • [Hidden layer(s)] • Output layer … f1 fK z1 z2 … zM … x1 x2 xp
Terminology • Feed-forward network • Nodes in one layer are connected to the nodes in next layer • Recurrent network • Nodes in one layer may be connected to the ones in previous layer or within the same layer
Terminology Formulas for multilayer perceptron (MLP) • C1, C2combination function • g, ςactivation function • α0mβ0kbias of hidden unit • αimβjkweight of connection
Recommended reading • Book, paragraph 9.2 • EM Reference: Tree Node Start with: • Book, paragraph 11 • EM Reference: Neural Network node