Machine Learning

Machine Learning CPS4801

Research Day • Keynote Speaker • Tuesday 9:30-11:00 STEM Lecture Hall (2nd floor) • Meet-and-Greet 11:30 STEM 512 • Faculty Presentation • Tuesday 11:00-3:00 STEM • Prof. Liou 2:00 Room 415 • Student Poster • Wednesday 10:00-3:00 • Computer Science 10:00-12:00 STEM Atrium • Schedule:http://orsp.kean.edu/ResearchDays_Schedule.html

Outline • Introduction • Decision tree learning • Clustering • Artificial Neural Networks • Genetic algorithms

Learning from Examples • An agent is learning if it improves its performance on future tasks after making observations about the world. • One class of learning problem: • from a collection of input-output pairs, learn a function that predicts the output for new inputs.

Why learning? • The designer cannot anticipate all possible situations • A robot designed to navigate mazes must learn the layout of each new maze. • The designer cannot anticipate all changes • A program designed to predict tomorrow’s stock market prices must learn to adapt when conditions change. • Programmers sometimes have no idea how to program a solution • recognizing faces

Types of Learning • Supervised learning • example input-output pairs and learns a function • Unsupervised learning • correct answers not given • clustering: a taxi agent must develop a concept of “good traffic days” and “bad traffic days” • Reinforcement learning • rewards or punishments • taxi agent: lack of a tip • chess game: two points for a win

Supervised Learning • Learning a function/rule from specific input-output pairs is also called inductive learning. • Given a training set of N example pairs: • (x1,y1), (x2,y2), ..., (xN, yN) • target unknown function y = f(x) • Problem: find a hypothesish such that h ≈ f • h is generalized well if it correctly predicts the value of y for novel examples (test set).

Supervised Learning • When the output y is one of the finite set of values (sunny, cloudy, rainy), the learning problem is called classification. • Boolean or binary classification • When y is a number (tomorrow’s temperature), the problem is called regression.

Inductive learning method • The points are in the (x,y) plane, where y = f(x). • We approximate f with h selected from a hypothesis space H. • Construct/adjust h to agree with f on training set

Inductive learning method • Construct/adjust h to agree with f on training set • E.g., curve fitting:

Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting:

Inductive learning method • Construct/adjust h to agree with f on training set • (h is consistent if it agrees with f on all examples) • E.g., curve fitting: • How to choose from among multiple consistent hypotheses?

Inductive learning method • Ockham’s razor: prefer the simplest hypothesis consistent with data (14th-century English philosopher William of Ockham) • There is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.

Cross-Validation Labeled data (1566) Split into 10 folds 9 folds (approx. 1409) 1 fold (approx. 157) Train Model Evaluate Lather, rinse, repeat (10 times) Report average

Learning decision trees • One of the simplest and yet most successful forms of machine learning. • A decision tree represents a function that takes as input a vector of attribute values and returns a “decision” – a single output. • discrete input, Boolean classification

Learning decision trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: • Alternate: is there an alternative restaurant nearby? • Bar: is there a comfortable bar area to wait in? • Fri/Sat: is today Friday or Saturday? • Hungry: are we hungry? • Patrons: number of people in the restaurant (None, Some, Full) • Price: price range ($, $$, $$$) • Raining: is it raining outside? • Reservation: have we made a reservation? • Type: kind of restaurant (French, Italian, Thai, Burger) • WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Decision trees • One possible representation for hypotheses (no Price and Type) • “true” tree for deciding whether to wait:

Expressiveness • Decision trees can express any function of the input attributes. • E.g., for Boolean functions, truth table row → path to leaf: • Goal <==> (Path1 v Path2 v Path3 v ...) • Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. • Prefer to find more compact decision trees

Decision trees • One possible representation for hypotheses (no Price and Type) • “true” tree for deciding whether to wait:

Constructing the Decision Tree • Goal: Find the smallest decision tree consistent with the examples • divide-and-conquer: Test the most important attribute first, divides the problem up into smaller subproblems that can be solved recursively. • “Most important”: attribute that best splits examples • Form tree with root = best attribute • For each value vi (or range) of best attribute • Selects those examples with best=vi • Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best • Add a branch to tree with label=vi and subtree=subtreei

Decision tree learning • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree

Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Which is a better choice?

Attribute-based representations • Examples described by attribute values • A training set of 12 examples • E.g., situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F)

Choosing the Best Attribute:Binary Classification • Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction • Information theory (Shannon and Weaver 49) • Entropy: a measure of uncertainty of a random variable • A coin that always comes up heads --> 0 • A flip of a fair coin (Heads or tails) --> 1(bit) • The roll of a fair four-sided die --> 2(bit) • Information gain: the expected reduction in entropy caused by partitioning the examples according to this attribute

Formula for Entropy Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative:H(1/2,1/2) = -1/2log21/2 -1/2log21/2 = 1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100) = -.01log2.01 -.99log2.99 = .08 bits

Information gain • Information gain (from attribute test) = difference between the original information requirement and new requirement • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest IG

Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than the “true” tree

Machine Learning