580 likes | 650 Views
Announcements. HW1 assigned due Monday HW2 may be assigned prior to next class, check email Reading Assignment Chapter 3 of the text book (decision trees) Midterm exam In class on Oct 25. Last Time. How accurate is algorithm A? What are its 95% confidence intervals?
E N D
Announcements • HW1 assigned due Monday • HW2 may be assigned prior to next class, check email • Reading Assignment • Chapter 3 of the text book (decision trees) • Midterm exam • In class on Oct 25
Last Time • How accurate is algorithm A? What are its 95% confidence intervals? • Apply CLT to get Gaussian PDF for acc. • Find Z statistic to get CI • Do performances of Algos. A & B differ? • Use 10 fold CV to get deltas • Estimate the mean and std dev of delta • Compute t stat and get CI for delta • Does it contain 0? • How can we eval. Algorithms if FN and FP costs differ? • ROC curves
Today’sTopics • Finish ROC + Precision-Recall curves • Our next SL algorithm, Logistic Regresion • Discriminative vs Generative • Perceptrons, Neural networks
Plot ROC Curve Example ML Algo Output (Sorted) Correct Category Ex 9 .99 + Ex 7 .98 + Ex 1 .72 - Ex 2 .7 + Ex 6 .65 + Ex 10 .51 - Ex 3 .39 - Ex 5 .24 + Ex 4 .11 - Ex 3 .01 - TPR=(2/5) FPR=(0/5) 1.0 TPR=(2/5) FPR=(1/5) Prob (alg outputs + | + is correct) TPR=(4/5) FPR=(1/5) TP rate TPR=(4/5) FPR=(3/5) TPR=(5/5) FPR=(3/5) 1.0 FP rate TPR=(5/5) FPR=(5/5) Prob (alg outputs + | - is correct)
Area Under ROC Curve • A common metric for experiments is to numerically integrate the ROC Curve 1.0 TP Rate 1.0 FP Rate
Precision vs Recall • Precision = TP / (TP + FP) • Recall = TP / (TP + FN) • Notice that TN is not used in either formula
ROC vs Recall-Precision • You can get very different visual results on the same data. Precision vs P ( + | + ) P ( + | - ) Recall The reason for this is because there may be lots of -
Recall-Precision Curve • You cannot simply connect the dots in Recall-Precision curves. • See Goadrich, Oliphant, & Shavlik ILP’04 Precision x Recall
Exp Methodology Wrapup • Never train on test sets. (use tune sets) • Use central-limit theorem to place confidence intervals on measurements • Paired t-test’s provide a sensitive way to judge whether two algorithms perform differently. • t-test is a useful heuristic for guiding research • Use a two-tailed test • ROC curves are better than accuracy
The Logistic Function (also called sigmoid) • Sigmoid dates back to 19th century • Originally used to model growth of populations Logistic function: y x
The Logistic Function (also called sigmoid) • Logistic Regression assumes the conditional Pr(C=1|F) is a sigmoid Real valued feature vector y linear function of the features x
Logistic Regression This gives us Pr(C=0|F) since Pr(C=1|F) + Pr(C=0|F) =1.0 So the odds are And
LR Decision Rule Predict class is + if Threshold (0 if equal FP and FN costs)
The Decision Boundary of Logistic Regression is a hyperplane (line in 2D) If predict + otherwise + + + predict - + - - - - + +
Encoding Nominal Features • Logistic Regression requires that examples be represented as a vector of real values (also perceptrons, Neural Nets, SVMs, …) • How can transform FVs with nominal features to real values?
Two Possibilities • For nominal feature with M possible values: • Assign each value an integer between 1 and M • Color = {Red, Green, Blue} Color = {1, 2, 3} • Create M binary features where for each example (without missing features) exactly one derived feature has value of 1 and M-1 features have a value of 0 • Color = {Red, Green, Blue} • isRed={0,1}, isGreen={0,1}, isBlue={0,1} Not a good idea (why?)
Representation of Pr(C|F) same for LR and Navie Bayes with nominal features Navie Bayes: Sum over features Sum over features Sum over values Derived binary feature Logistic Regression:
Representation of Pr(C|F) same for LR and Navie Bayes with nominal features Navie Bayes: Sum over features Sum over features Sum over values Derived binary feature Logistic Regression:
What about real valued features? If then Same variance The log of the ratio of two Gaussians with equal variance is a line see text for details
Learning Task • Given: • Labeled examples {(C,F)} • Do: • Find a good setting of the weights W
Learning Parameters for Logistic Regression • Typically, we want the weights W that maximize the conditional log-likelihood Sum over training examples Since each setting for the weights gives has an associated likelihood, We can view the likelihood as a function of the weights
“Weight Space” • Given feature representations, the weights W are free parameters that define a space • Each point in “weight space” corresponds to an LR model • Associated with each point is a conditional log likelihood • One way to do LR learning is to perform “gradient ascent” in the weight space Goal L(W) W For LR, L(W) is a concave function (it has a single global maximum), so we are guaranteed to find the global maximum
L W1 W2 The Gradient-Ascent Rule L(W) [ ] L wN L w0 L w1 L w2 , , , … … … , _ The “gradient” • The direction of gradient at W is direction of fastest increase • The magnitude of gradient at W is the rate of fastest increase • Since we want to increase L(W), we want to go “up hill” • We’ll take a finite step in weight space: L W = L ( W ) or wi = “delta” = change to W E wi L(W)
“On Line” vs. “Batch” Updates • We can either update W after each example is examined (on line / stochastic updates) or after the entire training set is examined (batch) updates • On-line is typically much faster • But is dependent on the order of the examples • Will it converge to the same spot? • For non-concave “objective functions” online and batch processing will typically end up in different places
Logistic Regression Update Rule sum over training examples Prediction error for example k Note that this is the batch update rule
Two Decisions Needed for Learning Procedure • What are we trying to optimize? • How are we going to carry out the optimization?
Discriminant Models • Can classify instances into categories • Captures differences between categories • May not describe all features • Example: Decision trees (covered later) • Efficient and simple
Generative Models • Can create complete input feature vectors • Describes distributions of all features • Stochastically creates a plausible vector • Example: Bayes net (from above)
Using Generative Models • Make a model to generate positives • Make a model to generate negatives • Classify a test example based on which is more likely to generate it • The Naïve Bayes ratio does this
Some Properties(Ng & Jordan NIPS ‘02) • If NB assumption holds asymptotic accuracy is the same • Otherwise LR acc > NB acc as the number of training examples increases • NB converges to asymptotic performance with fewer examples, LR takes more • NB is faster to train
Perceptrons Input units Output unit F0=1 f1 w1 w0 f2 w2 ∑ wN fN The decision rule for perceptrons has the same form as the decision rule for logistic regression and naïve Bayes So perceptrons are linear separators
Training Perceptrons • Perceptron training rule: • If training data is not linearly separable, training may not converge • Delta Rule • gradient descent rule derived from objective function based on minimizing the squared error of linear output unit. Almost identical to LR on-line training rule
Should you? "Fenwίck here is biding his time waiting for neural networks.
Concept Learning Learning sytems differ in how they represent concepts: Backpropagation Training Examples C4.5 CART AQ. FOIL X^Y Z … …
Advantages of Neural Networks • Provide best predictive accuracy for some problems • Being supplanted by SVM’s? • Can represent a rich class of concepts Positive negative Positive Saturday: 40% chance of rain Sunday: 25% chance of rain
Artificial Neural Networks (ANNs) Networks Output units Recurrent link Hidden units error weight Input units
output i= F(Sweighti,j x outputj) Where F(inputi) = outputs j 1 1+e bias inputs -(inputi - biasi) ANNs (continued) Individual units
Perceptron Convergence Theorem(Rosemblatt, 1957) Perceptron = no Hidden Units Ifa set of examples is learnable, the DELTA rule will eventually find the necessary weights However a perceptron can only learn/represent linearly separable dataset.
WiXi + WjXj = Q Xj = = Q - WiXi Wj WiQ Xj Wj Xi + Linear Separability Consider a perceptron Its output is 1If W1X1+W2X2 + … + WnXn > Q 0otherwise In terms of feature space: + + + + + + - + - - + + + + - + + - - - + + - - + - - - - - [ y = mx + b] Hence, can only classify examples if a “line” (hyerplane) can separate them
Output 0 1 1 0 a) b) c) d) Input 0 0 0 1 1 0 1 1 The XOR Problem Exclusive OR (XOR) Not linearly separable: X1 1 b d a c X2 0 1 A Neural Network Solution 1 1 X1 -1 -1 Let Q = 0 ! X2 1 1
The Need for Hidden Units If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded. (N = number of input units) This recoding allows any mapping to be represented (Minsky & Papert) Question: How to provide an error signal to the interior units?
Hidden Units • One View: • Allow a system to create its own internal representation – for which problem solving is easy. A perceptron
Reformulating XOR X1 X1 X3 = X1 ^ X2 X2 X3 Or: X1 X2 So, if a hidden unit can learn to represent X1 ^ X2 , solution is easy X2
dE dWi,j Backpropagation • Backpropagation involves a generalization of the delta rule • Rumelhart, Parker, and Le Cun (and Bryson & Ho(1969), Werbos(1974)) independently developed(1985) a technique for determining how to adjust weights of interior (“hidden”) units • Derivation involves partial derivatives (Hence, threshold function must be differentiable) error signal
Weight Space • Given a network layout, the weights and biases are free parameters that define a Space. • Each point in this Wight Space (w) specifies a network • Associated with each point is an error rate, E, over the training data • BackProp performs gradient descent in weight space
E W1 dE dw W2 Gradient descent in weight space W1 W2