380 likes | 419 Views
Machine Learning. Neural Networks Eric Xing Lecture 3, August 12, 2010. Reading: Chap. 5 CB. Learning highly non-linear functions. f: X Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars. The XOR gate.
E N D
Machine Learning Neural Networks Eric Xing Lecture 3, August 12, 2010 Reading: Chap. 5 CB
Learning highly non-linear functions f: X Y • f might be non-linear function • X (vector of) continuous and/or discrete vars • Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition
Perceptron and Neural Nets • From biological neuron to artificial neuron (perceptron) • Activation function • Artificial neuron networks • supervised learning • gradient descent
Connectionist Models • Consider humans: • Neuron switching time ~ 0.001 second • Number of neurons ~ 1010 • Connections per neuron ~ 104-5 • Scene recognition time ~ 0.1 second • 100 inference steps doesn't seem like enough much parallel computation • Properties of artificial neural nets (ANN) • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed processes
1 1 20 37 10 1 Abdominal Pain Perceptron Intensity Duration Pain Male Temp Age WBC Pain adjustable weights Diverticulitis Pancreatitis Appendicitis Ulcer Pain Obstruction Cholecystitis Duodenal Non-specific Small Bowel Perforated 0 0 0 1 0 0 0
Intensity Duration Elevation Pain Smoker Pain ECG: ST Age Male 2 1 1 50 1 4 Myocardial Infarction “Probability” of MI Myocardial Infarction Network 0.8
Input units Headache Cough rule D change weights to weights decrease the error what we got - what we wanted Pneumonia No disease Flu Meningitis error Output units Perceptrons
Perceptrons Output of unit j : q - ( a + ) Output o = 1/ (1 + e j j ) j j units Input to unit j : a = w a S j ij i Input to unit i : a i measured value of variable i i Input units
Inputs Output Age 34 5 0.6 4 Gender 1 S “Probability of beingAlive” 4 8 Stage Jargon Pseudo-Correspondence • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights” • Estimates = “targets” Logistic Regression Model (the sigmoid unit) Independent variables x1, x2, x3 Coefficients a, b, c Dependent variable p Prediction
The perceptron learning algorithm • Recall the nice property of sigmoid function • Consider regression problem f:XY , for scalar Y: • Let’s maximize the conditional data likelihood
xd = input td = target output od = observed unit output wi = weight i Gradient Descent
Incremental mode: • Do until converge: • For each training example d in D • 1. compute gradient ÑEd[w] • 2. • where Batch mode: Do until converge: 1. compute gradient ÑED[w] 2. xd = input td = target output od = observed unit output wi = weight i The perceptron learning rules
MLE vs MAP • Maximum conditional likelihood estimate • Maximum a posteriori estimate
y = 0.5 q w w 1 2 f(x w + x w ) = y 1 1 2 2 x x 1 2 f(0w + 0w ) = 1 1 2 f(0w + 1w ) = 1 1 2 f(1w + 0w ) = 1 1 2 f(1w + 1w ) = 0 1 2 1, for a > q f(a) = 0, for a £ q q some possible values for w and w 1 2 w w 2 1 0.35 0.20 0.40 0.20 0.30 0.25 0.20 0.40 What decision surface does a perceptron define? NAND
y = 0.5 q w w 1 2 f(x w + x w ) = y 1 1 2 2 x x 1 2 f(0w + 0w ) = 0 1 2 f(0w + 1w ) = 1 1 2 f(1w + 0w ) = 1 1 2 f(1w + 1w ) = 0 1 2 1, for a > q f(a) = 0, for a £ q q some possible values for w and w 1 2 w w 2 1 What decision surface does a perceptron define? NAND
= 0.5 for all units q w5 w6 w1 w2 w4 w3 1, for a > q f(a) = 0, for a £ q q What decision surface does a perceptron define? NAND a possible set of values for (w1, w2, w3, w4, w5 , w6): (0.6,-0.6,-0.7,0.8,1,1)
S S S Neural Network Model Inputs Output .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights
S “Combined logistic models” Inputs Output .6 Age 34 .5 0.6 .1 Gender 2 .8 “Probability of beingAlive” .7 4 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights
S Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 “Probability of beingAlive” .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights
S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights
S S S Not really, no target for hidden units... .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights
Input units Headache Cough rule D change weights to weights decrease the error what we got - what we wanted Pneumonia No disease Flu Meningitis error Output units Perceptrons
Hidden Units and Backpropagation backpropagation
xd = input td = target output od = observed unit output wi = weight i Backpropagation Algorithm • Initialize all weights to small random numbers Until convergence, Do • Input the training example to the network and compute the network outputs • For each output unit k • For each hidden unit h • Undate each network weight wi,j where
More on Backpropatation • It is doing gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum • In practice, often works well (can run multiple times) • Often include weight momentuma • Minimizes error over training examples • Will it generalize well to subsequent testing examples? • Training can take thousands of iterations, very slow! • Using network after training is very fast
Learning Hidden Layer Representation • A network: • A target function: • Can this be learned?
Learning Hidden Layer Representation • A network: • Learned hidden layer representation:
Expressive Capabilities of ANNs • Boolean functions: • Every Boolean function can be represented by network with single hidden layer • But might require exponential (in number of inputs) hidden units • Continuous functions: • Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].
The model The learned hidden unit weights Application: ANN for Face Reco.
Minimizing the Error initial error Error surface negative derivative final error local minimum initial trained w w positive change
Overfitting in Neural Nets Overfitted model “Real” model Overfitted model CHD error holdout training 0 age cycles
Alternative Error Functions • Penalize large weights: • Training on target slopes as well as values • Tie together weights • E.g., in phoneme recognition
Artificial neural networks – what you should know • Highly expressive non-linear functions • Highly parallel network of logistic function units • Minimizing sum of squared training errors • Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values • Minimizing sum of sq errors plus weight squared (regularization) • MAP estimates assuming weight priors are zero mean Gaussian • Gradient descent as training procedure • How to derive your own gradient descent procedure • Discover useful representations at hidden units • Local minima is greatest problem • Overfitting, regularization, early stopping