Machine Learning

Machine Learning Georg Pölzlbauer December 11, 2006

Outline • Exercises • Data Preparation • Decision Trees • Model Selection • Random Forests • Support Vector Machines

Exercises • Groups of 2 or 3 students • UCI ML Repository: pick 3 data sets (different characteristics, i.e. number of samples, number of dimensions, number of classes) • Estimate classification error with 3 classifiers of choice; compare results • Estimate appropriate parameters for these classifiers • Implement in Matlab, R, WEKA, YALE, KNIME

Exercises: Software • Matlab • YALEhttp://rapid-i.com/ • WEKAhttp://www.cs.waikato.ac.nz/ml/weka/ • KNIMEhttp://www.knime.org/ • Rhttp://www.r-project.org/

Exercises: Software • WEKA: recommended; easy to use, easy to learn, no programming • KNIME, YALE: also easy to use • R: most advanced and powerful software; do not use if you do not know R really well! • Matlab: not recommended; requires installation of packages from internet etc.

Exercises: Written Report • Report should be 5-10 pages • Discuss characteristics of data sets (i.e. handling of missing values, scaling etc.) • Summarize classifiers used (one paragraph each) • Discuss experimental results (tables, figures) • Do not include code in report

Exercises: How to proceed • It is not necessary to implement anything; rely on libraries, modules etc • UCI ML Repository:http://www.ics.uci.edu/~mlearn/MLSummary.html • Import data file, scale data, apply model selection, write down any problems/findings

Grading • No written/oral exam • End of January submission of report • Ca. 15 minutes discussion of results and code (individually for each group) • Grading bonus: Use of sophisticated models, detailed comparision of classifiers, thorough discussion of experiments, justification of choices

Questions? • Questions regarding theory: • poelzlbauer@ifs.tuwien.ac.at • musliu@dbai.tuwien.ac.at • Questions regarding R, Weka, …: • Forum

Machine Learning: Setting

Machine Learning: Setting Train ML Model

Data Preparation • -> Example adult census data • Table format data (data matrix) • Missing values • Categorical data • Quantitative (continuous) data with different scales

Categorical variables • Non-numeric variables with a finite number of levels • E.g. "red", "blue", "green" • Some ML algorithms can only handle numeric variables • Solution: 1-to-N coding

1-to-N Coding

Scaling of continuous variables • Many ML algorithms rely on measuring the distance between 2 samples • There should be no difference if a length variable is measured in cm, inch, or km • To remove the unit of measure (e.g. kg, mph, …) each variable dimension is normalized: • subtract mean • divide by standard deviation

Scaling of continuous variables • Data set now has mean 0, variance 1 • Chebyshev's inequality: • 75% of data between -2 and +2 • 89% of data between -3 and +3 • 94% of data between -4 and +4

Household income $10.000 $200.000 very low low average high very high Output variables • ML requires categorical output (continuous output = regression) • ML methods can be applied by binning continuous output (loss of prediction accuracy)

Binary Decision Trees • Rely on Information Theory (Shannon) • Recursive algorithm that splits feature space into 2 areas at each recursion step • Classification works by going through the tree from the root node until arriving at a leaf node

Decision Trees: Example

Information Theory, Entropy • Introduced by Claude Shannon • Applications in data compression • Concerned with measuring actual information vs. redundancy • Measures information in bits

What is „Entropy“? • In Machine Learning, Entropy is a measure for the impurity of a set • High Entropy => bad for prediction • High Entropy => needs to be reduced (Information Gain)

Calculating H(X)

H(X): Case studies p(xred) p(xblue) H(X) I 0.5 0.5 1 II 0.3 0.7 0.88 III 0.7 0.3 0.88 IV 0 1 0

H(X): Relative vs. absolute frequencies vs. => H(XI) = H(XII) Only relative frequencies matter!

Information Gain Information Gain: Sets that minimize Entropy by largest amount Given a set and a choice between possible subsets, which one is preferable? H(X) = 1

Informatin Gain (Properties) • IG is at most as large as the Entropy of the original set • IG is the amount by which the original Entropy can be reduced by splitting into subsets • IG is at least zero (if Entropy is not reduced) • 0 <= IG <= H(X)

Building (binary) Decision Trees • Data set: categorical or quantitative variables • Iterate variables, calculate IG for every possible split • categorical variables: one variable vs. the rest • quantitative variables: sort values, split between each pair of values • recurse until prediction is good enough

Decision Trees: Quantitative variables 0.07 0.00 0.01 0.03 0.08 0.03 0.00 0.00 0.01 0.13 0.06 original H: 0.99 0.17 0.01 0.11 0.43 0.26 0.06 0.13 0.05 0.29 0.28 0.09 0.16

Decision Trees: Quantitative variables

Decision Trees: Classification

Decision Trees: More than 2 classes

Decision Trees: Non-binary trees

Decision Trees: Overfitting • Fully grown trees are usually too complicated

Decision Trees: Stopping Criteria • Stop when absolute number of samples is low (below a threshold) • Stop when Entropy is already relatively low (below a threshold) • Stop if IG is low • Stop if decision could be random (Chi-Square test) • Threshold values are hyperparameters

Decision Trees: Pruning • "Pruning" means removing nodes from a tree after training has finished • Stopping criteria are sometimes referred to as "pre-pruning" • Redundant nodes are removed, sometimes tree is remodeled

Example: Pruning

Decision Trees: Stability

Machine Learning