140 likes | 326 Views
Announcements. Nearing the end (gasp…) Two lectures on miscellaneous topics One on evaluation of learning systems Topics Topics / confusions in statistics Artificial Neural Nets Any specific topics desired for Wednesday??? Final Exam Dec. 12, 7-8:15 PM (not cumulative).
E N D
Announcements • Nearing the end (gasp…) • Two lectures on miscellaneous topics • One on evaluation of learning systems • Topics • Topics / confusions in statistics • Artificial Neural Nets • Any specific topics desired for Wednesday??? • Final Exam Dec. 12, 7-8:15 PM (not cumulative) CS446-Fall ’06
Religious wars among StatisticiansWhat’s real? Distribution or Data • Frequentist / objectivist / Fisherian / classical statistics • Probability = limiting relative frequencies as the sample size increases • Bayes Theorem is OK but not so central • Priors and inference must be “objective” – rooted in counting • Ronald Fisher • Bayesian • Rev. Thomas Bayes (also Laplace) • Inference should also reflect beliefs (“subjective priors”) • Bayes theorem specifies how to use subjective priors • Probabilities as capturing uncertainties • They often agree on conclusions • Methods often differ • Bayesians will claim inferences that frequentists eschew CS446-Fall ’06
Are You a Frequentist? • The world is a distribution • Data are a random sampling – our beliefs are irrelevant • We can come to know the World via data • the distribution is primary • the particular data are incidental and not important • Different samples have the same expected information (assuming independent samples & same sample size) • Hypothesize; Observe; Evaluate • Changing the hypothesis taints the data • baseball • lottery • Stock scam, wrong hypothesis CS446-Fall ’06
Are You a Bayesian? • Evidence is primary • Evidence can be objective (data) or subjective • Evidence (even as data) can testify for / against different distributions at the same time • Will my plane crash? • What is the chance of rain tomorrow (yesterday) • Subjective uncertainty as a statistical distribution • Two meter problem CS446-Fall ’06
Two Envelope Problem • I have two envelopes • They each contain money • I offer you one • You can’t tell from the outside, but I put twice as much in one as the other… • What’s the analysis for this problem? CS446-Fall ’06
Simpson’s Paradox • Dr Bayes – an implemented statistical inference system • There is a dreaded disease w/ two treatments A and B • Dr. Bayes has seen some training data • We observe this dialog: • Should Patient 1 take A or B? • Dr.B.: Is Patient 1 male or female? Male • Dr.B.: Patient 1 should take A • Should Patient 2 take A or B? • Dr.B.: Is Patient 2 male or female? Female • Dr.B.: Patient 2 should take A • Should Patient 3 take A or B? • Dr.B.: Is Patient 3 male or female? unknown • Dr.B.: Patient 3 should take B • Do we look for a bug in Dr. Bayes? CS446-Fall ’06
Dr. Bayes • Three Boolean random variables: Gender M/F, Treatment A/B, Improvement Y/N • 100 patients: • 50 M 50 F • 50 A 50 B • Want probability of improvement given what we know: gender and treatment • P(Y|M,A) = 0.625 P(Y|M,B) = 0.5 • P(Y|F,A) = 0.9 P(Y|F,B) = 0.8 • P(Y|A) = 0.68 P(Y|B) = 0.74 CS446-Fall ’06
Simpson’s Paradox • Real life examples • Quality of health care in hospitals • Gender discrimination in Eng College admission CS446-Fall ’06
… Perceptron to ANN • Very limited expressiveness • Can’t do XOR on two Booleans • If only we could stack them • What functions could we represent? CS446-Fall ’06
… … … … What can multi-layer perceptrons (ANNs) represent? What if we change the topology? More levels = more expressiveness? CS446-Fall ’06
Can We Still Learn Efficiently?(is there a generalized perceptron convergence theorem?) … … … … Now any assignment of labels (any function) can be represented Is this a good thing? CS446-Fall ’06
No* • Minsky and Papert suspected there was not in Perceptrons (1969) • This largely killed off research interest • Minsky and Papert were right * but for a slightly modified linear device the answer becomes Yes, quite easily CS446-Fall ’06
… Threshold or step fcndiscontinuous, non-differentiable … … Sigmoid fcn differentiable Why “No”? CS446-Fall ’06
Back-Propogation • Hinton, Rumlehart,… • Common sigmoid: g(x) = (1+e-x)-1 • Then g’ = g(1-g) • This is the missing factor in our original gradient weight update expression. • Now internal gradients exist (and are easily calculated) • Standard gradient descent works quite well • Can get caught in local extrema • Boltzmann machine • Add a hidden node • Random restarts • Need to limit hidden nodes – why? • Suppose we learn to 100% accuracy on the training data • Interpreting hidden nodes / extracting rules from ANNs • New resurgence in interest from statistical learning – “neural” largely gone • Think: a nonlinear multidimensional optimization device CS446-Fall ’06