Machine Learning

Machine Learning SCE 5820: Machine Leanring Instructor: Jinbo Bi Computer Science and Engineering Dept.

Course Information • Instructor: Dr. Jinbo Bi • Office: ITEB 233 • Phone: 860-486-1458 • Email:jinbo@engr.uconn.edu • Web: http://www.engr.uconn.edu/~jinbo/ • Time: Tue / Thur. 2:00pm – 3:15pm • Location: BCH 302 • Office hours: Thur. 3:15-4:15pm • HuskyCT • http://learn.uconn.edu • Login with your NetID and password • Illustration

Regression and classification • Both regression and classification problems are typically supervised learning problems • The main property of supervised learning • Training example contains the input variables and the corresponding target label • The goal is to find a good mapping from the input variables to the target variable

Classification: Definition • Given a collection of examples (training set ) • Each example contains a set of variables (features), and the target variable class. • Find a model for class attribute as a function of the values of other variables. • Goal: previously unseen examples should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Application 1 Test Set Model categorical categorical continuous Current data, want to use the model to predict class Learn Classifier Training Set Past transaction records, label them Fraud detection – goals: Predict fraudulent cases in credit card transactions.

Classification: Application 2 • Handwritten Digit Recognition • Goal: Identify the digit of a handwritten number • Approach: • Align all images to derive the features • Model the class (identity) based on these features

Illustrating Classification Task

Classification algorithms K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models

Regression: Definition Goal: predict the value of one or more continuous target attributes give the values of the input attributes Difference between classification and regression only lies in the target attribute Classification: discrete or categorical target Regression: continuous target Greatly studied in statistics, neural network fields.

Refund Marital Taxable Tid Loss Status Income 100 1 Yes Single 125K 120 2 No Married 100K -200 3 No Single 70K -300 4 Yes Married 120K -400 5 No Divorced 95K -500 6 No Married 60K -190 7 Yes Divorced 220K 300 8 No Single 85K Test Set -240 9 No Married 75K Model 90 10 No Single 90K 10 Regression application 1 Continuous target categorical categorical continuous Current data, want to use the model to predict Learn Regressor Training Set Past transaction records, label them goals: Predict the possible loss from a customer

Regression applications • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices.

Regression algorithms • Least squares methods • Regularized linear regression (ridge regression) • Neural networks • Support vector machines (SVM) • Bayesian linear regression

Practical issues in the training • Underfitting • Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression

Least squares • We wish to use some real-valued input variables x to predict the value of a target y • We collect training data of pairs (xi,yi), i=1,…N • Suppose we have a model f that maps each x example to a value of y’ • Sum of squares function: • Sum of the squares of the deviation between the observed target value y and the predicted value y’

Least squares • Find a function f such that the sum of squares is minimized • For example, your function is in the form of linear functions f (x) = wTx • Least squares with a linear function of parameters w is called “linear regression”

Linear regression • Linear regression has a closed-form solution for w • The minimum is attained at the zero derivative

Polynomial Curve Fitting • x is evenly distributed from [0,1] • y = f(x) + random error • y = sin(2πx) + ε, ε ~ N(0,σ)

Polynomial Curve Fitting

Sum-of-Squares Error Function

0th Order Polynomial

1st Order Polynomial

3rd Order Polynomial

9th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Data Set Size: 9th Order Polynomial

Regularization Penalize large coefficient values Ridge regression

Regularization:

Regularization: vs.

Polynomial Coefficients

Ridge Regression • Derive the analytic solution to the optimization problem for ridge regression Using KKT condition – first order derivative = 0

Neural networks • Introduction • Different designs of NN • Feed-forward Network (MLP) • Network Training • Error Back-propagation • Regularization

Introduction • Neuroscience studies how networks of neurons produce intellectual behavior, cognition, emotion and physiological responses • Computer science studies how to simulate knowledge in cognitive science, including the way neurons process signals • Artificial neural networks simulate the connectivity in the neural system, the way it passes through signal, and mimic the massively parallel operations of the human brain

Common features Dentrites

Different types of NN • Adaptive NN: have a set of adjustable parameters that can be tuned • Topological NN • Recurrent NN

InputLayer OutputLayer HiddenLayer Different types of NN • Feed-forward NN • Multi-layer perceptron • Linear perceptron

Different types of NN • Radial basis function NN (RBFN)

Multi-Layer Perceptron • Layered perceptron networks can realize any logical function, however there is no simple way to estimate the parameters/generalize the (single layer) Perceptron convergence procedure • Multi-layer perceptron (MLP) networks are a class of models that are formed from layered sigmoidal nodes, which can be used for regression or classification purposes. • They are commonly trained using gradient descent on a mean squared error performance function, using a technique known as error back propagation in order to calculate the gradients. • Widely applied to many prediction and classification problems over the past 15 years.

x1 w1 x2 w2 y Σ : : wt xt Linear perceptron y = w1*x1 + w2*x2 + … + wt*xt Input layer output layer Many functions can not be approximated using perceptron

Multi-Layer Perceptron • XOR (exclusive OR) problem • 0+0=0 • 1+1=2=0 mod 2 • 1+0=1 • 0+1=1 • Perceptron does not work here! Single layer generates a linear decision boundary

x1 W11(1) f(Σ) W11(2) x2 W21(1) y f(Σ) W22(1) : : W21(2) wt1(1) f(Σ) xt Multi-Layer Perceptron Input layer Hidden layer output layer Each link is associated with a weight, and these weights are the tuning parameters to be learned Each neuron except ones in the input layer receives inputs from the previous layer, and reports an output to next layer

w1 w2 . . . S wn Each neuron f is Activation function summation • The activation function f can be • Identity function f(x) = x • Sigmoid function • Hyperbolic tangent

Universal Approximation of MLP 1st layer 2nd layer 3rd layer Universal Approximation: Three-layer network can in principle approximate any function with any accuracy!

x1 y x2 xt Feed-forward network function Signal flows • The output from each hidden node • The final output M nodes N nodes

Network Training • A supervised neural network is a function h(x;w) that maps from inputs x to target y • Usually training a NN does not involve the change of NN structures (such as how many hidden layers or how many hidden nodes) • Training NN refers to adjusting the values of connection weights so that h(x;w) adapts to the problem • Use sum of squares as the error metric Use gradient descent

Gradient descent • Review of gradient descent • Iterative algorithm containing many iterations • Each iteration, the weights w receive a small update • Terminate • until the network is stable (in other words, the training error cannot be reduced further) E(wnew) < E(w) not hold • until the error on a validation set starts to climb up (early stopping)

Signal flows forwards x1 W ij y=h(x;w) x2 W jk xt N nodes M nodes Error Back-propagation • The update of the weights goes backwards because we have to use the chain rule to evaluate the gradient of E(w) Learning is backwards

W ij y=h(x;w) x2 W jk xt N nodes M nodes Error Back-propagation Learning is backwards • Update the weights in the output layer first • Propagate errors from the high layer to low layer • Recall x1

Machine Learning