Announcements

Announcements • HW1 assigned due Monday • HW2 may be assigned prior to next class, check email • Reading Assignment • Chapter 3 of the text book (decision trees) • Midterm exam • In class on Oct 25

Last Time • How accurate is algorithm A? What are its 95% confidence intervals? • Apply CLT to get Gaussian PDF for acc. • Find Z statistic to get CI • Do performances of Algos. A & B differ? • Use 10 fold CV to get deltas • Estimate the mean and std dev of delta • Compute t stat and get CI for delta • Does it contain 0? • How can we eval. Algorithms if FN and FP costs differ? • ROC curves

Today’sTopics • Finish ROC + Precision-Recall curves • Our next SL algorithm, Logistic Regresion • Discriminative vs Generative • Perceptrons, Neural networks

Plot ROC Curve Example ML Algo Output (Sorted) Correct Category Ex 9 .99 + Ex 7 .98 + Ex 1 .72 - Ex 2 .7 + Ex 6 .65 + Ex 10 .51 - Ex 3 .39 - Ex 5 .24 + Ex 4 .11 - Ex 3 .01 - TPR=(2/5) FPR=(0/5) 1.0 TPR=(2/5) FPR=(1/5) Prob (alg outputs + | + is correct) TPR=(4/5) FPR=(1/5) TP rate TPR=(4/5) FPR=(3/5) TPR=(5/5) FPR=(3/5) 1.0 FP rate TPR=(5/5) FPR=(5/5) Prob (alg outputs + | - is correct)

Area Under ROC Curve • A common metric for experiments is to numerically integrate the ROC Curve 1.0 TP Rate 1.0 FP Rate

Precision vs Recall • Precision = TP / (TP + FP) • Recall = TP / (TP + FN) • Notice that TN is not used in either formula

ROC vs Recall-Precision • You can get very different visual results on the same data. Precision vs P ( + | + ) P ( + | - ) Recall The reason for this is because there may be lots of -

Recall-Precision Curve • You cannot simply connect the dots in Recall-Precision curves. • See Goadrich, Oliphant, & Shavlik ILP’04 Precision x Recall

Exp Methodology Wrapup • Never train on test sets. (use tune sets) • Use central-limit theorem to place confidence intervals on measurements • Paired t-test’s provide a sensitive way to judge whether two algorithms perform differently. • t-test is a useful heuristic for guiding research • Use a two-tailed test • ROC curves are better than accuracy

Next Topic: Logistic Regression

The Logistic Function (also called sigmoid) • Sigmoid dates back to 19th century • Originally used to model growth of populations Logistic function: y x

The Logistic Function (also called sigmoid) • Logistic Regression assumes the conditional Pr(C=1|F) is a sigmoid Real valued feature vector y linear function of the features x

Logistic Regression This gives us Pr(C=0|F) since Pr(C=1|F) + Pr(C=0|F) =1.0 So the odds are And

LR Decision Rule Predict class is + if Threshold (0 if equal FP and FN costs)

The Decision Boundary of Logistic Regression is a hyperplane (line in 2D) If predict + otherwise + + + predict - + - - - - + +

Encoding Nominal Features • Logistic Regression requires that examples be represented as a vector of real values (also perceptrons, Neural Nets, SVMs, …) • How can transform FVs with nominal features to real values?

Two Possibilities • For nominal feature with M possible values: • Assign each value an integer between 1 and M • Color = {Red, Green, Blue} Color = {1, 2, 3} • Create M binary features where for each example (without missing features) exactly one derived feature has value of 1 and M-1 features have a value of 0 • Color = {Red, Green, Blue} • isRed={0,1}, isGreen={0,1}, isBlue={0,1} Not a good idea (why?)

Representation of Pr(C|F) same for LR and Navie Bayes with nominal features Navie Bayes: Sum over features Sum over features Sum over values Derived binary feature Logistic Regression:

What about real valued features? If then Same variance The log of the ratio of two Gaussians with equal variance is a line see text for details

Learning Task • Given: • Labeled examples {(C,F)} • Do: • Find a good setting of the weights W

Learning Parameters for Logistic Regression • Typically, we want the weights W that maximize the conditional log-likelihood Sum over training examples Since each setting for the weights gives has an associated likelihood, We can view the likelihood as a function of the weights

“Weight Space” • Given feature representations, the weights W are free parameters that define a space • Each point in “weight space” corresponds to an LR model • Associated with each point is a conditional log likelihood • One way to do LR learning is to perform “gradient ascent” in the weight space Goal L(W) W For LR, L(W) is a concave function (it has a single global maximum), so we are guaranteed to find the global maximum

L W1 W2 The Gradient-Ascent Rule L(W)  [ ] L wN L w0 L w1 L w2 , , , … … … , _ The “gradient” • The direction of gradient at W is direction of fastest increase • The magnitude of gradient at W is the rate of fastest increase • Since we want to increase L(W), we want to go “up hill” • We’ll take a finite step in weight space: L W =  L ( W ) or wi =  “delta” = change to W E wi L(W)

“On Line” vs. “Batch” Updates • We can either update W after each example is examined (on line / stochastic updates) or after the entire training set is examined (batch) updates • On-line is typically much faster • But is dependent on the order of the examples • Will it converge to the same spot? • For non-concave “objective functions” online and batch processing will typically end up in different places

Computing the LR GradientPage 11

Logistic Regression Update Rule sum over training examples Prediction error for example k Note that this is the batch update rule

Two Decisions Needed for Learning Procedure • What are we trying to optimize? • How are we going to carry out the optimization?

Discriminant Models • Can classify instances into categories • Captures differences between categories • May not describe all features • Example: Decision trees (covered later) • Efficient and simple

Generative Models • Can create complete input feature vectors • Describes distributions of all features • Stochastically creates a plausible vector • Example: Bayes net (from above)

Using Generative Models • Make a model to generate positives • Make a model to generate negatives • Classify a test example based on which is more likely to generate it • The Naïve Bayes ratio does this

Some Properties(Ng & Jordan NIPS ‘02) • If NB assumption holds asymptotic accuracy is the same • Otherwise LR acc > NB acc as the number of training examples increases • NB converges to asymptotic performance with fewer examples, LR takes more • NB is faster to train

Neural Nets

Perceptrons Input units Output unit F0=1 f1 w1 w0 f2 w2 ∑ wN fN The decision rule for perceptrons has the same form as the decision rule for logistic regression and naïve Bayes So perceptrons are linear separators

Training Perceptrons • Perceptron training rule: • If training data is not linearly separable, training may not converge • Delta Rule • gradient descent rule derived from objective function based on minimizing the squared error of linear output unit. Almost identical to LR on-line training rule

Should you? "Fenwίck here is biding his time waiting for neural networks.

Concept Learning Learning sytems differ in how they represent concepts: Backpropagation Training Examples C4.5 CART AQ. FOIL X^Y  Z … …

Advantages of Neural Networks • Provide best predictive accuracy for some problems • Being supplanted by SVM’s? • Can represent a rich class of concepts Positive negative Positive Saturday: 40% chance of rain Sunday: 25% chance of rain

Artificial Neural Networks (ANNs) Networks Output units Recurrent link Hidden units error weight Input units

output i= F(Sweighti,j x outputj) Where F(inputi) = outputs j 1 1+e bias inputs -(inputi - biasi) ANNs (continued) Individual units

Perceptron Convergence Theorem(Rosemblatt, 1957) Perceptron = no Hidden Units Ifa set of examples is learnable, the DELTA rule will eventually find the necessary weights However a perceptron can only learn/represent linearly separable dataset.

WiXi + WjXj = Q Xj = = Q - WiXi Wj WiQ Xj Wj Xi + Linear Separability Consider a perceptron Its output is 1If W1X1+W2X2 + … + WnXn > Q 0otherwise In terms of feature space: + + + + + + - + - - + + + + - + + - - - + + - - + - - - - - [ y = mx + b] Hence, can only classify examples if a “line” (hyerplane) can separate them

Output 0 1 1 0 a) b) c) d) Input 0 0 0 1 1 0 1 1 The XOR Problem Exclusive OR (XOR) Not linearly separable: X1 1 b d a c X2 0 1 A Neural Network Solution 1 1 X1 -1 -1 Let Q = 0 ! X2 1 1

The Need for Hidden Units If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded. (N = number of input units) This recoding allows any mapping to be represented (Minsky & Papert) Question: How to provide an error signal to the interior units?

Hidden Units • One View: • Allow a system to create its own internal representation – for which problem solving is easy. A perceptron

Reformulating XOR X1 X1 X3 = X1 ^ X2 X2 X3 Or: X1 X2 So, if a hidden unit can learn to represent X1 ^ X2 , solution is easy X2

Backpropagation

dE dWi,j Backpropagation • Backpropagation involves a generalization of the delta rule • Rumelhart, Parker, and Le Cun (and Bryson & Ho(1969), Werbos(1974)) independently developed(1985) a technique for determining how to adjust weights of interior (“hidden”) units • Derivation involves partial derivatives (Hence, threshold function must be differentiable) error signal

Weight Space • Given a network layout, the weights and biases are free parameters that define a Space. • Each point in this Wight Space (w) specifies a network • Associated with each point is an error rate, E, over the training data • BackProp performs gradient descent in weight space

E W1 dE dw W2 Gradient descent in weight space W1 W2

Announcements

Announcements

Presentation Transcript

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements