310 likes | 498 Views
Chapter1: Introduction Chapter2: Overview of Supervised Learning. 2006.01.20. Supervised learning. Training data set: several features and outcome Build a learner based on training data sets Predict the future unseen outcome from seen features of data.
E N D
Chapter1: IntroductionChapter2: Overview of Supervised Learning 2006.01.20
Supervised learning • Training data set: several features and outcome • Build a learner based on training data sets • Predict the future unseen outcome from seen features of data
An example of supervised learning Email spam Known Normal Emails … … … Spam … … … … New emails Learner … Spam Unknown Normal emails
Input & Output • Input = predictor = independent variable • Output = response = dependent variable
Output Types • Quantitative >> regression • Ex) stock price, temperature, age • Qualitative >> classification • Ex) Yes/No,
Input Types • Quantitative • Qualitative • Ordered categorical • Ex) small, medium, big
Terminology • X : input • Xj : j th component • X : matrix • xj : j th observed value • Y : quantitative output • Y : prediction • G: qualitative output ^
unknown General model • Given input X, output Y • Want to estimate the function f based on known data set (training data)
Two simple methods • Linear model, linear regression • Nearest neighbor method
Linear model • Give a vector of input features X = (X1…Xp) • Assume the linear relationship: • Least squares standard: min -2
new Nearest neighbor method • Majority vote within the k nearest neighbors K= 1: brown K= 3: green
Linear model #parameters: p Stable, smooth Low variance, high bias K-nearest neighbor #parameters: N/k Unstable, wiggly High variance, low bias Linear model vs. K-nearest neighbor Each method has its own situations for which it works best.
Enhanced Methods • Kernel methods using weights • Modifying the distance kernels • Locally weighted least squares • Expansion of inputs for arbitrarily complex models • Projection & neural network
Statistical decision theory (1) • Given input X in Rp, output Y in R • Joint distribution: Pr(X,Y) • Looking for predicting function: f(X) • Squared error loss: • Nearest-neighbor methods : min EPE ^
k-Nearest neighbor If N,k , k/N 0 Insufficient samples! Curse of dimensionality! Linear model But, the true function might not be linear! Statistical decision theory (2)
Statistical decision theory (3) • If • Robust • But, discontinuous in their derivatives ^
Statistical decision theory (4) • G : categorical output variable • L : Loss Function • EPE = E[L(G, G(X))] • Bayesian Classifier ^
References • Reading group on "elements of statistical learning” – overview.ppt • http://sifaka.cs.uiuc.edu/taotao/stat.html • Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf • http://www.stat.ohio-state.edu/~goel/STATLEARN/ • The Matrix Cookbook • http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/3274/pdf/imm3274.pdf • A First Course in Probability
2.5 Local Methods in High Dimensions • With a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging. • The curse of dimensionality • To capture 1% of data to form a local average, we must cover 63% of the range of each input variable. • The expected edge length = • All sample points are close to an edge of the sample. • Median distance from the origin to the closest data point:
2.5 Local Methods in High Dimensions • Example 1-NN vs. Linear • 1-NN • As p increases, MSE & bias tends to 1.0. • Linear model • Expecting on x0, the expected EPE increases linearly as a function of p. Sq. Bias Variance = 0. By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.
2.6 Statistical Models, Supervised Learning and Function Approximation • Finding a useful approximation to function that underlies the predictive relationship between the inputs and outputs. • Supervised learning: machine learning point of view • Function approximation: mathematics and statistics point of view
2.7 Structured Regression Models • Nearest-neighbor and other local methods face problems in high dimensions. • They may be inappropriate even in low dimensions. • Need for structured approaches. • Difficulty of the problem • Infinitely many solutions to minimizing RSS. • Unique solution comes from restrictions on f.
2.8 Classes of Restricted Estimators • Methods categorized by the nature of the restrictions. • Roughness penalty and Bayesian methods • Penalizing functions that too rapidly vary over small regions of input space. • Kernel methods and local regression • Explicitly specifying the nature of local neighborhood (kernel function). • Need adaptation in high dimensions. • Basis functions and dictionary methods • Linear expansion of basis functions.
2.9 Model Selection and the Bias-Variance Tradeoff • All models have a smoothing or complexity parameter to be determined • Multiplier of the penalty term • Width of the kernel • Number of basis functions
Bias-Variance tradeoff Essential with ε, no way to reduce To reduce one might increase the other. Tradeoff!
Model complexity High Bias Low Variance Low Bias High Variance Prediction Error Test error Training error Model Complexity High Low