520 likes | 547 Views
Goodfellow: Chap 5 Machine Learning Basics. Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content. Chapter Sections. Introduction 1 Learning Algorithms 2 Capacity, Overfitting and Underfitting
E N D
Goodfellow: Chap 5Machine Learning Basics Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content.
Chapter Sections • Introduction • 1 Learning Algorithms • 2 Capacity, Overfitting and Underfitting • 3 Hyperparameters and Validation Sets • 4 Estimators, Bias andVariance • 5 Maximum Likelihood Estimation • 6 Bayesian Statistics (not covered in this course) • 7 Supervised Learning Algorithms • 8 Unsupervised Learning Algorithms (later in course) • 9 Stochastic Gradient Descent • 10 Building a Machine Learning Algorithm • 11 Challenges Motivating Deep Learning
Introduction • Machine Learning • Form of applied statistics with extensive use of computers to estimate complicated functions • Two central approaches • Frequentist estimators and Bayesian inference • Two categories of machine learning • Supervised learning and unsupervised learning • Most machine learning algorithms based on stochastic gradient descent optimization • Focus on building machine learning algorithms
1 Learning Algorithms • A machine learning algorithm learns from data • A computer program learns from • experience E • with respect to some class of tasks T • and performance measure P • if it’s performance P on tasks T improves with experience E
1.1 The Task, T • Machine learning tasks are too difficult to solve with fixed programs designed by humans • Machine learning tasks are usually described in terms of how the system should process an example where • An example is a collection of features measured from an object or event • Many kinds of tasks • Classification, regression, transcription, anomaly detection, imputation of missing values, etc.
1.2 The Performance Measure, P • Performance P is usually specific to the Task T • For classification tasks, P is usually accuracy • The proportion of examples correctly classified • Here we are interested in the accuracy on data not seen before – a test set
1.3 The Experience, E • Most learning algorithms in this book are allowed to experience an entire dataset • Unsupervised learning algorithms • Learn useful structural properties of the dataset • Supervised learning algorithms • Experience a dataset containing features with each example associated with a label or target • For example, target provided by a teacher • Blur between supervised and unsupervised
1.4 Example: Linear Regression • Linear regression takes a vector x as input and predicts the value of a scalar y as it’s output • Task: predict scalar y from vector x • Experience: set of vectors X and vector of targets y • Performance: mean squared error => normal eqs • Solution of the normal equations is
Pseudoinverse MethodIntro to Algorithms by Cormen, et al. • General linear regression • Assume polynomial functions • Minimizing mean squared error => normal eqs • Solution of the normal equations is
Example: Pseudoinverse Method • Simple linear regression homework example • Find • Given points
LinearRegression Linearregressionexample Optimizationofw 3 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 2 1 MSE(train) 0 y -1 -2 -3 -1.0 -0.5 0.0 0.5 1.0 x1 0.5 1.0 w1 1.5 Figure5.1 (Goodfellow2016)
2 Capacity, Overfittingand Underfitting • The central challenge in machine learning is to perform well on new, previously unseen inputs, and not just on the training inputs • We want the generalization error (test error) to be low as well as the training error • This is what separates machine learning from optimization
2 Capacity, Overfittingand Underfitting • How can we affect performance on the test set when we get to observe only the training set? • Statistical learning theory provides answers • We assume train and test sets are generated independent and identically distributed – i.i.d. • Thus, expected train and test errors are equal
2 Capacity, Overfittingand Underfitting • Because the model is developed from the training set the expected test error is usually greater than the expected training error • Factors determining machine performance are • Make the training error small • Make the gap between train and test error small • These two factors correspond to the two challenges in machine learning • Underfitting and overfitting
2 Capacity, Overfittingand Underfitting • Underfitting and overfitting can be controlled somewhat by altering the model’s capacity • Capacity = model’s ability to fit variety of functions • Low capacity models struggle to fit the training data • High capacity models can overfit the training data • One way to control the capacity of a model is by choosing its hypothesis space • For example, simple linear vs quadratic regression
Underfitting and Overfitting in Polynomial Estimation Underfitting Appropriatecapacity Overfitting y y y x0 x0 x0 Figure5.2 (Goodfellow2016)
2 Capacity, Overfittingand Underfitting • Model capacity can be changed in many ways – above we changed the number of parameters • We can also change the family of functions –called the model’s representational capacity • Statistical learning theory provides various means of quantifying model capacity • The most well-known is the Vapnik-Chervonenkis dimension • Old non-statistical Occam’s razor method takes simplest of competing hypotheses
2 Capacity, Overfittingand Underfitting • While simpler functions are more likely to generalize (have a small gap between training and test error) we still want low training error • Training error usually decreases until it asymptotes to the minimum possible error • Generalization error usually has a U-shaped curve as a function of model capacity
GeneralizationandCapacity Training error Generalization error Underfittingzone Overfittingzone Error Generalizationgap 0 OptimalCapacity Capacity Figure5.3 (Goodfellow2016)
2 Capacity, Overfittingand Underfitting • Non-parametric models can reach arbitrarily high capacities • For example, the complexity of the nearest neighbor algorithm is a function of the training set size • Training and generalization error vary as the size of the training set varies • Test error decreases with increased training set size
TrainingSetSize 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Bayes error Train (quadratic) Test (quadratic) Test (optimal capacity) Error(MSE) Train(optimalcapacity) 0.0 Figure5.4 100 101 102 103 Numberoftrainingexamples 105 104 Optimalcapacity(polynomialdegree) 20 15 10 5 0 100 101 102 103 Numberoftrainingexamples 105 104 (Goodfellow2016)
2.1 No Free Lunch Theorem • Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points • In other words, no machine learning algorithm is universally any better than any other • However, by making good assumptions about the encountered probability distributions, we can design high-performing learning algorithms • This is the goal of machine learning research
2.2 Regularization • The no free lunch theorem implies that we design our learning algorithm on a specific task • Thus, we build preferences into the algorithm • Regularization is a way to do this and weight decay is a method of regularization
2.2 Regularization • Applying weight decay to linear regression • This expresses a preference for smaller weights • l controls the preference strength with l = 0 imposing no preference strength • The next figure shows 9th degree polynomial training with various values of l • The true function is quadratic
WeightDecay Overfitting (λ →() Underfitting (Excessive λ) Appropriate weight decay (Medium λ) y y y x( x( x( (Goodfellow2016)
2.2 Regularization • More generally, we can add a penalty called a regularizer to the cost function • Expressing preferences for one function over another is a more general way of controlling a model’s capacity, rather than including or excluding members from the hypothesis space • Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error
3 Hyperparameters and Validation Sets • Most machine learning algorithms have several settings, called hyperparameters, to control the behavior of the algorithm • Examples • In polynomial regression, the degree of the polynomial is a capacity hyperparameter • In weight decay, l is a hyperparameter
3 Hyperparameters and Validation Sets • To avoid overfitting during training, the training set is often split into two sets • Called the training set and the validation set • Typically, 80% of the training data is used for training and 20% for validation • Cross-Validation • For small datasets, k-fold cross validation partitions the data into k non-overlapping subsets • k trials are run, train on (k-1)/k of the data, test on 1/k • Test error is estimated by averaging error on each run
4 Estimators, Bias and Variance • 4.1 Point estimation • Prediction of quantity – e.g., max likelihood • 4.2 Bias • 4.3 Variance and Standard Error
4 Estimators, Bias and Variance • 4.4 Trading off Bias and Variance to Minimize Mean Squared Error • Most common way to negotiate this trade-off is to use cross-validation • Can also use MSE of the estimates • The relationship between bias and variance is tightly linked to capacity, underfitting, and overfitting
BiasandVariance Underfittingzone Overfittingzone Bias Generalization error Variance Capacity Optimal capacity (Goodfellow2016)
5 Maximum Likelihood Estimation • Covered in Duda textbook
6 Bayesian Statistics • Not covered in this course
7 Supervised Learning Algorithms • 7.1 Probabilistic Supervised Learning • Bayes decision theory • 7.2 Support Vector Machines • Also covered by Duda – maybe later in semester • 7.3 Other Simple Supervised Learning Alg. • k-nearest neighbor algorithm • Decision trees – breaks input space into regions
DecisionTrees 0 1 10 00 01 11 010 010 011 110 111 00 01 1110 1111 0 011 110 1 11 10 1110 111 1111 Figure5.7 (Goodfellow2016)
8 Unsupervised Learning Algorithms • Algorithms with unlabeled examples • 8.1 Principal Component Analysis (PCA) • Covered by Duda – later in semester • 8.2 k-means Clustering • Covered by Duda – later in semester
PrincipalComponentsAnalysis 20 20 10 10 0 0 x2 z2 -10 -10 -20 -20 -20 -10 0 10 20 x1 -20 -10 0 10 20 z1 Figure5.8 (Goodfellow2016)
9 Stochastic Gradient Descent • Nearly all of deep learning is powered by stochastic gradient descent (SGD) • SGD is an extension of gradient descent introduced on section 4.3
10 Building a Machine Learning Algorithm • Deep learning algorithms use a simple recipe • Combine a specification of a dataset, a cost function, an optimization procedure, and a model
11 Challenges Motivating Deep Learning • The simple machine learning algorithms work well on many important problems • But they don’t solve the central problems of AI • Recognizing objects, etc. • Deep learning was motivated by this failure
11.1 Curse of Dimensionality • Machine learning problems get difficult when the number of dimensions in the data is high • This phenomenon is known as the curse of dimensionality • Next figure shows • 10 regions of interest in 1D • 100 regions in 2D • 1000 regions in 3D
CurseofDimensionality Figure5.9 (Goodfellow2016)
11.2 Local Consistency and Smoothness Regularization • To generalize well, machine learning algorithms need to be guided by prior beliefs • One of the most widely used “priors” is the smoothness prior or local constancy prior • Function should change little within a small region • To distinguish O(k) regions in input space requires O(k) samples • For the kNN algorithm, each training sample defines at most one region
NearestNeighbor Figure5.10 (Goodfellow2016)
11.3 Manifold Learning • A manifold is a connected region • Manifold algorithms reduce space of interest • Such as variation across n-dimensional Euclidean space reduced by assuming that most of the space consists of invalid input • Probability distributions over images, text strings, and sounds that occur in real life are highly concentrated • Fig 11 is a manifold, 12 is static noise, 13 shows the manifold structure of a dataset of faces
ManifoldLearning 2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Figure5.11 (Goodfellow2016)
UniformlySampledImages Figure5.12 (Goodfellow2016)
QMULDataset Figure5.13 (Goodfellow2016)