1.04k likes | 1.21k Views
Machine Learning. Georg Pölzlbauer December 11, 2006. Outline. Exercises Data Preparation Decision Trees Model Selection Random Forests Support Vector Machines. Exercises. Groups of 2 or 3 students
E N D
Machine Learning Georg Pölzlbauer December 11, 2006
Outline • Exercises • Data Preparation • Decision Trees • Model Selection • Random Forests • Support Vector Machines
Exercises • Groups of 2 or 3 students • UCI ML Repository: pick 3 data sets (different characteristics, i.e. number of samples, number of dimensions, number of classes) • Estimate classification error with 3 classifiers of choice; compare results • Estimate appropriate parameters for these classifiers • Implement in Matlab, R, WEKA, YALE, KNIME
Exercises: Software • Matlab • YALEhttp://rapid-i.com/ • WEKAhttp://www.cs.waikato.ac.nz/ml/weka/ • KNIMEhttp://www.knime.org/ • Rhttp://www.r-project.org/
Exercises: Software • WEKA: recommended; easy to use, easy to learn, no programming • KNIME, YALE: also easy to use • R: most advanced and powerful software; do not use if you do not know R really well! • Matlab: not recommended; requires installation of packages from internet etc.
Exercises: Written Report • Report should be 5-10 pages • Discuss characteristics of data sets (i.e. handling of missing values, scaling etc.) • Summarize classifiers used (one paragraph each) • Discuss experimental results (tables, figures) • Do not include code in report
Exercises: How to proceed • It is not necessary to implement anything; rely on libraries, modules etc • UCI ML Repository:http://www.ics.uci.edu/~mlearn/MLSummary.html • Import data file, scale data, apply model selection, write down any problems/findings
Grading • No written/oral exam • End of January submission of report • Ca. 15 minutes discussion of results and code (individually for each group) • Grading bonus: Use of sophisticated models, detailed comparision of classifiers, thorough discussion of experiments, justification of choices
Questions? • Questions regarding theory: • poelzlbauer@ifs.tuwien.ac.at • musliu@dbai.tuwien.ac.at • Questions regarding R, Weka, …: • Forum
Machine Learning: Setting Train ML Model
Machine Learning: Setting Train ML Model
Data Preparation • -> Example adult census data • Table format data (data matrix) • Missing values • Categorical data • Quantitative (continuous) data with different scales
Categorical variables • Non-numeric variables with a finite number of levels • E.g. "red", "blue", "green" • Some ML algorithms can only handle numeric variables • Solution: 1-to-N coding
Scaling of continuous variables • Many ML algorithms rely on measuring the distance between 2 samples • There should be no difference if a length variable is measured in cm, inch, or km • To remove the unit of measure (e.g. kg, mph, …) each variable dimension is normalized: • subtract mean • divide by standard deviation
Scaling of continuous variables • Data set now has mean 0, variance 1 • Chebyshev's inequality: • 75% of data between -2 and +2 • 89% of data between -3 and +3 • 94% of data between -4 and +4
Household income $10.000 $200.000 very low low average high very high Output variables • ML requires categorical output (continuous output = regression) • ML methods can be applied by binning continuous output (loss of prediction accuracy)
Binary Decision Trees • Rely on Information Theory (Shannon) • Recursive algorithm that splits feature space into 2 areas at each recursion step • Classification works by going through the tree from the root node until arriving at a leaf node
Information Theory, Entropy • Introduced by Claude Shannon • Applications in data compression • Concerned with measuring actual information vs. redundancy • Measures information in bits
What is „Entropy“? • In Machine Learning, Entropy is a measure for the impurity of a set • High Entropy => bad for prediction • High Entropy => needs to be reduced (Information Gain)
H(X): Case studies p(xred) p(xblue) H(X) I 0.5 0.5 1 II 0.3 0.7 0.88 III 0.7 0.3 0.88 IV 0 1 0
H(X): Relative vs. absolute frequencies vs. => H(XI) = H(XII) Only relative frequencies matter!
Information Gain Information Gain: Sets that minimize Entropy by largest amount Given a set and a choice between possible subsets, which one is preferable? H(X) = 1
Informatin Gain (Properties) • IG is at most as large as the Entropy of the original set • IG is the amount by which the original Entropy can be reduced by splitting into subsets • IG is at least zero (if Entropy is not reduced) • 0 <= IG <= H(X)
Building (binary) Decision Trees • Data set: categorical or quantitative variables • Iterate variables, calculate IG for every possible split • categorical variables: one variable vs. the rest • quantitative variables: sort values, split between each pair of values • recurse until prediction is good enough
Decision Trees: Quantitative variables 0.07 0.00 0.01 0.03 0.08 0.03 0.00 0.00 0.01 0.13 0.06 original H: 0.99 0.17 0.01 0.11 0.43 0.26 0.06 0.13 0.05 0.29 0.28 0.09 0.16
Decision Trees: Overfitting • Fully grown trees are usually too complicated
Decision Trees: Stopping Criteria • Stop when absolute number of samples is low (below a threshold) • Stop when Entropy is already relatively low (below a threshold) • Stop if IG is low • Stop if decision could be random (Chi-Square test) • Threshold values are hyperparameters
Decision Trees: Pruning • "Pruning" means removing nodes from a tree after training has finished • Stopping criteria are sometimes referred to as "pre-pruning" • Redundant nodes are removed, sometimes tree is remodeled