280 likes | 298 Views
This text discusses the importance of experimentation protocols and performance measures in machine learning, including the distinction between training and test sets, the use of k-fold evaluation, and the leave-one-out method. It also covers performance evaluation measures such as precision, recall, and confusion matrices, as well as learning curves and the prevention of overfitting. The text explores ensemble methods, learning methods, memory-based learning, and unsupervised learning techniques.
E N D
Learning from observations (b) • How good is a machine learner? • Experimentation protocols • Performance measures • Academic benchmarks vs Real Life KI1 / L. Schomaker - 2007
Experimentation protocols • Fooling yourself: training a decision tree on 100 example instances from earth and sending the robot to Mars • training set / test set distinction • both must be of sufficient size: • large training set for reliable ‘h’ (coefficients etc.) • large test set for reliable prediction of real-lifeperformance KI1 / L. Schomaker - 2007
Experimentation protocols • one training set / one test set, four yearsPhD project: still fooling yourself! • Solution: • training set • test set • final evaluation set with real-life data • k-Fold evaluation: k subsets fromlarge data base, measuring standard deviationof performance over experiments KI1 / L. Schomaker - 2007
Experimentation protocols • What to do if your don’t have enough data? • Solution: • Leave-one-out: use N-1 samples for training, • use the Nth sample for testing • repeat for all samples • compute the average performance KI1 / L. Schomaker - 2007
Performance • Example: % correctly classified samples (P) • Ptrain • Ptest • Preal Ptest KI1 / L. Schomaker - 2007
Performance, two-class KI1 / L. Schomaker - 2007
Performance, two-class KI1 / L. Schomaker - 2007
Performance, two-class Precision = 100 * #correct_hits / #says_Yes [%] Recall = 100 * #correct_hits / #is_Yes [%] KI1 / L. Schomaker - 2007
Performance, multi-class KI1 / L. Schomaker - 2007
Performance, multi-class Confusion matrix KI1 / L. Schomaker - 2007
Rankings / hit lists • Given a query Q, systems returns a hitlist of matches M: an ordered set, with instances i in decreasing likelihood of correctness • Precision: proportion of correct instances M in the hit list • Recall: proportion of correct instances from totalnumber of target samples in the database KI1 / L. Schomaker - 2007
Function approximation • For, e.g. regression models,learning an ‘analog’output • Example: target function t(x) • Obtained output function o(x) • For performance evaluation computeroot-mean square error (RMS error): • = ( (o(x)-t(x))2 / N ) KI1 / L. Schomaker - 2007
Learning curves P [% OK] #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Learning curves performance on training set P [% OK] performance on test set #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Learning curves 100% P [% OK] performance on training set performance on test set #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Learning curves 100% P [% OK] performance on training set no generalization,overfit Stop! performance on test set #epochs (presentations of training set) KI1 / L. Schomaker - 2007
Overfitting • The learner learns the training set • Even perfectly, like a lookup table (LUT) • memorizing training instances • without correctly handling unseen data • Usual cause: more parameters in the learner than in the data KI1 / L. Schomaker - 2007
Preventing Overfit • For good generalization: • number of training examples must be much larger than the number of attributes (features): Nsamples / Nattr >> 1 KI1 / L. Schomaker - 2007
Preventing Overfit • For good generalization: • also: Nsamples >> Ncoefficients e.g.: solving linear equation: 2 coefficients, needing 2 data points in 2D Coefficients: model parameters, weights etc. KI1 / L. Schomaker - 2007
Preventing Overfit • For good generalization: • Ndatavalues >> Ncoefficients Coefficients: model parameters, weights etc. Ndatavalues = Nsamples * Nattributes e.g.: use Ndatavalues/Ncoefficients for system comparison KI1 / L. Schomaker - 2007
Example: machine-print OCR • Very accurate, today, but: • Needs 5000 examples of each character • Printed on ink-jet, laser printers, matrixprinters, fax copies • of many brands of printers • on many paper types • for 1 font & point size! . . A . . . KI1 / L. Schomaker - 2007
Ensemble methods • Boosting: • train a learner h[m] • weigh each of the instances • weigh the method m • train a new learner h[m+1] • perform majority voting on ensembleopinions KI1 / L. Schomaker - 2007
The advantage of democracy: partly intelligent, independent deciders KI1 / L. Schomaker - 2007
Learning methods • Gradient descent, parameter finding(multi-layer perceptron, regression) • Expectation Maximization (smart Monte Carlo search for best model, given the data) • Knowledge-based, symbolic learning (Version Spaces) • Reinforcement learning • Bayesian learning KI1 / L. Schomaker - 2007
Memory-based ‘learning’ • Lookup-table (LUT) • Nearest neighbour argmin(dist) • k-Nearest neighbour majority(Nargmin(dist,k) KI1 / L. Schomaker - 2007
Unsupervised learning • K-means clustering • Kohonen self-organizing maps (SOM) • Hierarchical clustering KI1 / L. Schomaker - 2007
Summary (1) • Learning needed for unknown environments (and/or) lazy designers • Learning agent = performance element + learning element • Learning method depends on type of performance element, available • feedback, type of component to be improved, and its representation KI1 / L. Schomaker - 2007
Summary (2) • For supervised learning, the aim is to find a simple hypothesis • that is approximately consistent with training examples • Decision tree learning using information gain: entropy-based • Learning performance = prediction accuracy measured on test set(s) KI1 / L. Schomaker - 2007