Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter1: IntroductionChapter2: Overview of Supervised Learning 2006.01.20

Supervised learning • Training data set: several features and outcome • Build a learner based on training data sets • Predict the future unseen outcome from seen features of data

An example of supervised learning Email spam Known Normal Emails … … … Spam … … … … New emails Learner … Spam Unknown Normal emails

Input & Output • Input = predictor = independent variable • Output = response = dependent variable

Output Types • Quantitative >> regression • Ex) stock price, temperature, age • Qualitative >> classification • Ex) Yes/No,

Input Types • Quantitative • Qualitative • Ordered categorical • Ex) small, medium, big

Terminology • X : input • Xj : j th component • X : matrix • xj : j th observed value • Y : quantitative output • Y : prediction • G: qualitative output ^

unknown General model • Given input X, output Y • Want to estimate the function f based on known data set (training data)

Two simple methods • Linear model, linear regression • Nearest neighbor method

Linear model • Give a vector of input features X = (X1…Xp) • Assume the linear relationship: • Least squares standard: min -2

Classification example in two dimensions -1

new Nearest neighbor method • Majority vote within the k nearest neighbors K= 1: brown K= 3: green

Classification example in two dimensions -2

Linear model #parameters: p Stable, smooth Low variance, high bias K-nearest neighbor #parameters: N/k Unstable, wiggly High variance, low bias Linear model vs. K-nearest neighbor Each method has its own situations for which it works best.

Misclassification curves

Enhanced Methods • Kernel methods using weights • Modifying the distance kernels • Locally weighted least squares • Expansion of inputs for arbitrarily complex models • Projection & neural network

Statistical decision theory (1) • Given input X in Rp, output Y in R • Joint distribution: Pr(X,Y) • Looking for predicting function: f(X) • Squared error loss: • Nearest-neighbor methods : min EPE ^

k-Nearest neighbor If N,k , k/N 0 Insufficient samples! Curse of dimensionality! Linear model But, the true function might not be linear! Statistical decision theory (2)

Statistical decision theory (3) • If • Robust • But, discontinuous in their derivatives ^

Statistical decision theory (4) • G : categorical output variable • L : Loss Function • EPE = E[L(G, G(X))] • Bayesian Classifier ^

References • Reading group on "elements of statistical learning” – overview.ppt • http://sifaka.cs.uiuc.edu/taotao/stat.html • Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf • http://www.stat.ohio-state.edu/~goel/STATLEARN/ • The Matrix Cookbook • http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf • A First Course in Probability

2.5 Local Methods in High Dimensions • With a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging. • The curse of dimensionality • To capture 1% of data to form a local average, we must cover 63% of the range of each input variable. • The expected edge length = • All sample points are close to an edge of the sample. • Median distance from the origin to the closest data point:

2.5 Local Methods in High Dimensions • Example 1-NN vs. Linear • 1-NN • As p increases, MSE & bias tends to 1.0. • Linear model • Expecting on x0, the expected EPE increases linearly as a function of p. Sq. Bias Variance = 0. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.

2.6 Statistical Models, Supervised Learning and Function Approximation • Finding a useful approximation to function that underlies the predictive relationship between the inputs and outputs. • Supervised learning: machine learning point of view • Function approximation: mathematics and statistics point of view

2.7 Structured Regression Models • Nearest-neighbor and other local methods face problems in high dimensions. • They may be inappropriate even in low dimensions. • Need for structured approaches. • Difficulty of the problem • Infinitely many solutions to minimizing RSS. • Unique solution comes from restrictions on f.

2.8 Classes of Restricted Estimators • Methods categorized by the nature of the restrictions. • Roughness penalty and Bayesian methods • Penalizing functions that too rapidly vary over small regions of input space. • Kernel methods and local regression • Explicitly specifying the nature of local neighborhood (kernel function). • Need adaptation in high dimensions. • Basis functions and dictionary methods • Linear expansion of basis functions.

2.9 Model Selection and the Bias-Variance Tradeoff • All models have a smoothing or complexity parameter to be determined • Multiplier of the penalty term • Width of the kernel • Number of basis functions

Bias-Variance tradeoff Essential with ε, no way to reduce To reduce one might increase the other. Tradeoff!

Bias-Variance tradeoff in kNN

Model complexity High Bias Low Variance Low Bias High Variance Prediction Error Test error Training error Model Complexity High Low

Chapter1: Introduction Chapter2: Overview of Supervised Learning