Learning … a rather broad sense: improvement of performance on the basis of experience

What is learning? • Learning….in a rather broad sense: • improvement of performance on the basis of experience • Machine learning…… • improve for task T • with respect to performance measure P, • based on experience E

Learning from Observations. • There are 3 main types of learning: • Supervised learning – used in environments where an action is followed by immediate feedback • Reinforcement learning – used in environments where feedback on actions is not immediate • Unsupervised learning – used where there isn’t any feedback on actions!

Inductive learning is defined to be the process of learning from pre-classified examples. T = {e1, e2, . . . en}, where each ei = (a, o) = (a1a2. . .am ,o) 1. Choose h such that is minimized 2. Hypothesis "goodness"

Inductive Learning – Supervised Learning  Gather a set of input-output examples from some application: Training Set i.e. Stock Forecasting  Train the learning model (decision tree, neural network, etc.) on the training set until “done”  The Goal is to “generalize” on novel data not yet seen  Gather a further set of input-output examples from the same application: Test Set in order to validate what the system is doing  Use the learning system on actual data  Formally, given  a function f  a set of examples (x, f(x))  produce h such that h approximatesf

Motivation • Cost and Errors in Programming A Solution • Domain knowledge limited - financial forecasting • Encoding/extracting of domain knowledge may be expensive • Augment existing domain knowledge • Adaptability • General, easy-to use mechanism for a large set of applications • Do better than current approaches

Learn This!

Which?

What explanations do we prefer? Common Biases in learning  Minimize Error on known examples  Information gain  Ockham’s Razor – Prefer the simplest hypothesis that describes the data (mml, mdl).  Without bias, you cannot learn!  Bias influences what you will learn. Sometimes these biases are inherent in the basic learning algorithm you choose, sometimes they are implicit in the error function you are using.  So which biases are the “good” biases?  The conservation law of generalization performance [Schaffer, 1994] proves that no learning bias (algorithm) can outperform any other bias (algorithm) over the space of all possible learning tasks

Decision Trees • Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no answer. • Typically, each internal node in the tree corresponds to a test on a single attribute. However, you could have more complicated tests than that! And there are models that split on more than just a single attribute. The test need only split the data reaching the node in some way. • The branches emanating from a decision node are labeled with the possible values of the test for that branch. • Leaf nodes are where classification takes place. Leaf nodes are labeled with the Boolean value that should be returned if the node is reached (yes/no). Could also label with a probability.

Decision Tree Expressiveness ·Any Boolean function can be represented by a decision tree. ·Each tree represents a disjunction of conjunctions of constraints on the attribute values of instances ·In the worst case, the size of the tree needs to be exponential in the number of variables – i.e. a full-on representation of the truth table. Example – parity. ·Can we do better than that? Is there a representation for Boolean functions that has a worst case performance better than a decision tree? Wouldn’t it be great if there was?

·The problem is, there are a whole lot of Boolean functions on n variables. oA truth table with n variables has 2n rows, 1 row for each possible truth setting of the variables. oFor each row, the output of the function can be either a 0 or a 1. oSo, we have 2n bits that can be set to either 0 or 1. oThat means there are 2^2^n possible Boolean functions on n variables.

ID3/C4.5 ·Top down induction of decision trees ·non-incremental - are extensions ·Highly used and successful ·Attribute Features - discrete - output is also discrete ·Search for smallest tree is too complex ·Use greedy iterative approach

ID3 Learning Approach ·C is a set of examples ·A test on attribute A partitions C into {C1, C2,...,Cw} where w is number of states of A ·First find good A for root - Attribute which is "most important" ·Continue recursively until the training set is unambiguously classified or there are no more "relevant" features - C4.5 actually expands and then prunes back statistically irrelevant partitions afterwards

Choosing variables to split on - what should our Bias be? ·Bias: Ockham’s razor – find the smallest possible tree ·Roughly, we could accomplish this if we try to minimize depth of tree ·How? Many different ways. How about by picking an attribute that maximizes classification accuracy for that step (Greedy approach)? If the first attribute classifies everything correctly, we’re done (depth=1)!

Information Theory ·How much information do you need to be given in order to answer a yes/no question? §1 bit. ·How much information do you need to answer a yes/no question if you know that the yes answer has probability 1? §0 bits. You already know the answer. ·So, a yes/no question where each answer is .5 probable requires 1 bit of information to answer. ·And, a yes/no question where one answer is 100% probable requires 0 bits.

ID3 Learning Algorithm 1.S = Training Set 2. Calculate gain for each remaining attribute 3. Select highest and create a new node for each partition 4. For each partition - if one class then end - else if > 1 class then goto 2 with remaining attributes - else if empty, label with most common class of parent (or set as null) 5.if attributes exhausted? - (this will only happen for an inconsistent S) - label with majority class ·Attributes which best discriminate between classes are chosen ·If the same ratios are found in partitioned set, then gain is 0

Over-fitting • Definition? • Over-fitting: If h1 fits the training data better than h2, but h1 performs worse than h2 on new data • How do we avoid it?

ID3 Noise Handling Mechanisms – Early Stopping ·Could only allow attributes with info gains exceeding some threshold in order to sift noise. However, empirically tends to disallow relevant attribute tests. ·Use statistical (such as Chi-square) test to decide confidence in whether attribute is irrelevant. Best ID3 results. (Takes amount of data into account which is not done by above) ·Use a separate set of examples (holdout set) to determine when to stop ·When you decide to not split on any more attributes, label node with either most common, or with probability of most common (good for distribution vs function)

ID3 Noise Handling Mechanisms – Post Pruning • Consider each node, remove the subtree centered on it and test • Use a separate set of examples to prune • Use statistical tests • Rule Post Pruning • Convert tree to rules (1 rule for each path from a root to a leaf) • Generalize each rule by considering each of its pre-conditions • Sort rules according to estimated accuracy, and consider them in this order

ID3 - Missing Attribute Values - Learning ·Throw out data with missing attributes - too common, could be important, not prepared to generalize with missing attributes ·Set attribute to most probable attribute class ·Set attribute to most probable attribute class given the example class - similar performance ·Use a learning scheme (ID3, etc) to fill in attribute class where TS is made up of complete examples and the initial output class is just another attribute. Better, but not always empirically convincing ·Let unknown be just another attribute value - for ID3 has anomaly of apparent higher gain due to more attributes, can fix with gain ratio

ID3 - Missing Attribute Values - Execution ·When arriving at an attribute test for which the attribute is missing during execution ·Each branch has a probability of being taken based on what percentage of TS examples went down each branch ·Take all branches, but carry a weight representing the probability. Weights could be further modified (multiplied) by other missing attributes in current test example as they continue down the tree. ·Results in multiple active leaf nodes. Set output as leaf with highest weight, or sum weights for each output class, and output the class with the largest sum

ID3 - Conclusions ·Good Empirical Results ·Comparable application robustness and accuracy with neural networks - faster learning (though NN are better with continuous - both input and output) ·Most used and well known of current systems - used widely to aid in creating rules for expert systems

Learning … a rather broad sense: improvement of performance on the basis of experience