Machine Learning

Machine Learning Part I: Classification and Bayesian Learning Ref: E. Alpaydin, Intro to Machine Learning, MIT 2004

Machine Learning • Machine Leaning is programming computers to optimize a perf criteria using example data or past experience • Inference from samples • There is a process that explains the data we observe. But we don’t know the details about how the data are generated. • Internet requests, failure events, etc • It’s hard to identify (model) the process completely, we could construct a good and useful approximation that detect certain patterns. Such patterns would help us to understand the process and make predictions about the future.

Types of Machine Learning • Supervised learning is to create a function from training data. The training data consist of pairs of input objects (typically vectors), and desired outputs. • Classification: Given an input, the output is Boolean (yes/no) to predict a class label of the input object; • Regression: If the label is a numerical value, learn the function f(x) that best explain the input instance; • Unsupervised learning: manual labels of inputs are not used. • Clustering: partition a data set into subsets (clusters), so that the data in each subset share some common trait • Semi-supervised learning: make use of both labeled and unlabeled data for training • Reinforcement Learning • Learning a policy: A sequence of outputs; No supervised output but delayed reward • Examples: game playing, robot navigation

Supervised Learning • Use of Supervised Learning • Classification • Regression • Evaluation Methodology • Bayesian Learning for Classification

Why Supervised Learning? • Prediction of future cases:Use the rule to predict the output for future inputs • Knowledge extraction:The rule is easy to understand • Compression:The rule is simpler than the data it explains • Outlier detection: Exceptions that are not covered by the rule, e.g., fraud

E.g: Credit scoring Differentiating betweenlow-riskand high-risk customers from their income and savings Rule-based prediction Classification Discriminant: IF income > θ1 AND savings > θ2 THEN low-riskELSE high-risk

Learning a Class from Examples • Given a set of examples of cars, with a label of “family car” or not according to a survey, class learning is to find a description that is shared by all positive examples. • Use of the class info • Prediction: Is car x a family car? • Knowledge extraction: What do people expect from a family car?

Training set X Input representation Attributes: price & engine power Label of each instance

Hypothesis Class: C Most specific hypothesis, S Most general hypothesis, G Learning is to find a particular hypothesis h to approximate C

Hypothesis h and Empirical Error Error of h:

Model Selection & Generalization • Learning is an ill-posed problem: data is not sufficient to find a unique solution • Limited number of sample data • Some data might be noise due to imprecision in recording, labeling, or hidden (latent, unobservable) attributes that affect the label of instances • The need for inductive bias: assumptions about class structureH • Why rectangle, not circle or irregular shape? • What’s degree of tightness of fitting? • Generalization:How well a model performs on new data

Noise and Model Complexity Simple model is preferred • Easy to use (check) (lower time complexity) • Easy to train (lower space complexity) • Easyto explain (more interpretable) • Easy to generalize (lower variance ) Noise: any anomaly in the data which leads it infeasible to reach a zero-error classification with a simple hypothesis class

Probably Approximately Correct (PAC) Learning • How many training examples N should we have, suchthat with probability at least 1 ‒ δ, h has error at most ε ? • Each strip is at most ε/4 • Pr that we miss a strip 1‒ ε/4 • Pr that N instances miss a strip (1 ‒ ε/4)N • Pr that N instances miss 4 strips 4(1 ‒ ε/4)N • 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x) • 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

2-Class vs K-Class K-class problem be viewed as K 2-class problem: Train hypotheses hi(x), i =1,...,K:

Regression • Examples • Price of a used car • Speed of Top500 • x : car attributes y : price y = g (x | θ) g ( ) model, θ parameters Linear regression • y = wx+w0

Basic Concepts • Interpolation • Find a function that best fits a training set with no presence of noise • r = f(x) • Extrapolation • Predict the output for any x, if x is NOT in the training set • Regression • Noise factor must be considered • r = f(x) +  OR there’re hidden variables we couldn’t observe: r = f(x, z)

Regression For a given test set, find g() that minimizes the empirical error

Underfitting vs Overfitting • Underfitting: Hypothesis (H) less complex than actual model (C) • Using a line to fit data sampled from a 3rd order polynomial • Accuracy increases with more sample data; may not enough if the hypothesis is too complex • Overfitting: H more complex than C • Having more training data helps but only up to a certain point

Triple Trade-Off Trade-off between three factors : • Complexity of the hypothesisH, c (H): capacity of the hypothesis class • Training set size, N, • Generalization error, E, on new examples • As N, E¯ • As c (H), first E¯ and then E(The error of an over-complex hypothesis can be kept in check by increasing the amount of training data, but only up to a point)

Cross-Validation • To estimate generalization error, we need data unseen during training. • Three types of data in cross-validation: • Training set (50%) • Validation set (25%) • Test (publication) set (25%) • Resampling when there is few data

Dimensions of a Supervised Learner: Summary • Model g() and parameter  • Loss function L(): diff between desired output and approximation • Optimization procedure: return the argument that minimizes

Machine Learning