200 likes | 341 Views
Final Exam (not cumulative) Next Tuesday Dec. 12, 7-8:15 PM 1105 SC (This Room). Statistical Learning Parameterized Models Generative Models Discriminative Models Bayes - Rule - Networks Naïve - Likelihood function Estimation Maximum Likelihood Maximum A Posteriori
E N D
Final Exam (not cumulative) • Next Tuesday • Dec. 12, 7-8:15 PM • 1105 SC (This Room) CS446-Fall ’06
Statistical Learning Parameterized Models Generative Models Discriminative Models Bayes - Rule - Networks Naïve - Likelihood function Estimation Maximum Likelihood Maximum A Posteriori Conjugate Priors K-means Clustering Expectation Maximization Jordan's talk Logistic Regression Information Theory Conditional Information Mutual Information KL Divergence Ensembles Bayes Optimal Classifier Bagging Boosting Weak Learning Margin Distribution Frequentist / Bayesian Statistics ANNs Backpropagation Nonparametric classifiers Nearest Neighbor Bias / Variance k Nearest Neighbor Kernel Smoothing Dimensionality Reduction LDA, PCA, MDS, FA Local Linear Embedding Neural Network Derived Features Model Selection Fit / Regularization AIC, BIC Kolmogorov Complexity, MDL K-fold Cross Validation Leave one out cross validation Learning Curve ROC Curve Topics Since Midterm CS446-Fall ’06
Dimensionality Reduction • Many approaches • Supervised • Feature selection • Linear Discriminant Analysis (LDA/Fisher; we’ve seen this already) • … • Unsupervised - what does this mean? • Principle Component Analysis (PCA) • Multidimensional Scaling • Factor Analysis • … • Also some nonlinear dimensionality reduction • Neural Net application CS446-Fall ’06
Linear Discriminant Analysis (LDA) • Introduce a “new” feature • Linear combination of old features • The new feature should maximize distance between classes • And simultaneously minimize variance within classes: • Maximize | mP – mN|2 / (S2p + s2N) • Project points into subspace and repeat CS446-Fall ’06
Unsupervised – do not use class label • Principle component analysis (PCA) • Like LDA but maximize variance in X • Factor Analysis • Observed raw features are imagined to be derived • Introduce K latent features that “cause” the observed features • Estimate them • Multidimensional Scaling • Suppose we know distances between examples • Find a map in K dimensions that supports • Placement of examples • Computes desired distances CS446-Fall ’06
Y 1 2 X Principle Component Analysis (PCA) Find direction of maximum variance Project Repeat This is an eigenvalue problem; we are looking for eigenvectors CS446-Fall ’06
Principle Component Analysis • Eigenvector of the largest eigenvalue is the direction of greatest variance • Second largest is greatest remaining variance etc. • They are orthogonal and form the new (linear) features • Use eigenvectors of the largest k eigenvaluesor down to some relative size of eigenvalues • Previous LDA example: • First LDA component • First PCA component • Is this good? CS446-Fall ’06
Multidimensional Scaling (MDS) • Given distances between examples • Position examples in a lower dimensional metric space faithful to these distances • Flying times between pairs Works particularly well here. Why? CS446-Fall ’06
Factor Analysis (FA) • Assume the observed features are really manifestations of some number k of more primitive factors • These factors are unobservable • Assume they are uncorrelated linear combinations of the original features • In matrix form: X = + LF + • WhereX is an exampleis the mean of each featureL is a matrix of factor loadsF are the factors is a noise term • Similar to PCA except for which can aid in post hocinterpretation of the factors Features Factors Class CS446-Fall ’06
Non-linear Dimensionality Reduction • Linear is limited: many classifiers easily handle anyway • Potential big gains in nonlinear transformationsBut a rich space with huge potential for overfitting • Kernel PCA – PCA with kernel functions • Local Linear Embedding (LLE) – currently very popular • Note that distance is measured along the lower dimensional manifold, assumptions: smoothness, density CS446-Fall ’06
Using Neural Networks • Use hidden layers to learn lower dimensional features • Couple the example features to both the input and output • Learns to reproduce the input features • Hidden layer is floating Features Hidden Layer Features CS446-Fall ’06
Using Neural Networks • Use hidden layers to learn lower dimensional features • Couple the example features to both the input and output • Learns to reproduce the input features • Hidden layer is floating • Limit the number of hidden units Features Hidden Layer Features CS446-Fall ’06
Using Neural Networks • Use hidden layers to learn lower dimensional features • Couple the example features to both the input and output • Learns to reproduce the input features • Hidden layer is floating • Limit the number of hidden units Features Hidden Layer Features Hidden Nodes Learn the Best Nonlinear Transformationsthat Reproduce the Input Features CS446-Fall ’06
Model Selection • Comparing error rate alone on models of different complexity is not fair • Consider the Likelihood Function • It will tend to prefer a more complex model • Why? • Overfitting • Need regularization • Penalty to compensate for complexity • Richer model families are more likely to find a good fit by accident CS446-Fall ’06
Information Criteria • L is the likelihood; k is number of parameters; N is number of examples Prefer the higher scoring models • Akaike Information Criterion (AIC) AIC = ln(L) – k fit complexity penalty • Bayesian Information Criterion (BIC)BIC = ln(L) – k ln(N)/4 • Minimum Description Length (MDL) is the same as BIC • Kolmogorov complexity • Learning = Data Compression • Compression bounds; bound test accuracy from training alone • Luckiness Framework; PAC-Bayes CS446-Fall ’06
Cross Validation • Good For • Setting Parameters • Choosing Models • Evaluating a Learner • Data Resampling Technique • Different partition sets of the training data are somewhat independent • Overlap introduces some bias, this can be estimated if necessary • In statistics: bootstrap, jackknife CS446-Fall ’06
Computing a learning curve • Classifier performance as a function of training • Desire confidences • Perhaps Error Bars: • Usually 95% standard error of the mean • Need multiple runs but have limited data • Each point is generated by cross validation CS446-Fall ’06
Cross Validation • k-fold cross validation • Partition the data D into k equal disjoint sets: d1, d2… dk • For i=1 to k Train on D-di Test on di • Generates a population of results Can compute average performance Confidence measures • Most popular / most standard is k = 10 • When k = |D| it is called “Leave one out cross validation” CS446-Fall ’06
Cross Validation • Every example gets used as a test example exactly once • Every example gets used as a training example k-1 times. • Test sets are independent but training sets overlap significantly. • The hypotheses are generated using (k-1)/k of the training data. • With resampling, “paired” statistical design can be used to compare two or more learners • Paired tests are statistically stronger since outcome variations due to the test set are identical in each fold CS446-Fall ’06
ROC Curve • Often a classifier can be adjusted to have more false positives or more false negatives • This can be used to hide weaknesses of the classifier • Receiver Operating Characteristic Curve • Prob. of True Positivevs. Prob. of False Positiveas sensitivity is increased • The – (left peak) and + (right peak) populations overlap • The classification boundary is the vertical line • The relevant areas are labeled: TP: true positives = Red + Purple; FP: false positives = Pink + Purple; TN: true negatives = Dark Blue + Light Blue; FN: false negatives = Light Blue CS446-Fall ’06