Machine Learning: Regression or Classification?

Machine Learning:Regression or Classification? J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

GHW Dataset • GHW: Gender-height-weight • Source • A table of 10000 entries with 3 columns • Gender: Categorical • Height: Numerical • Weight: Numerical

Multiple Ways of Prediction • Multiple ways of prediction based on GHW dataset • Classification: Height, weight  gender • Regression: Gender, height  weight • Regression: Gender, weight  height • In general • Classification: the output is categorical • Regression: the output is numerical • Sometimes it’s not so obvious!

NBC for Iris Dataset (1/3) Scatter plot of Iris dataset (with only the last two dim.) PDF on each features and each class

NBC for Iris Dataset (2/3) PDF for each class

NBC for Iris Dataset (3/3) Decision boundaries (using the last 2 features)

Strength and Weakness of NBC Quiz! • Strength • Fast computation during training and evaluation • Robust than QC • Weakness • No able to deal with bi-modal data correctly • Class boundary not as complex as QC

K-Nearest Neighbor Classifiers(KNNC) J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Concept of KNNC Quiz! : class-A point : class-B point : point with unknown class Feature 2 3 nearest neighbors  The point is classified as B via 3NNC. • Concept: 近朱者赤、近墨者黑 • Steps: • Find the first k nearest neighbors of a given point. • Determine the class of the given point by voting among k nearest neighbors. Feature 1

Flowchart for KNNC General flowchart of PR: KNNC: Feature extraction From raw data to features Model construction Clustering (optional) Model evaluation KNNC evaluation on test dataset

Decision Boundary for 1NNC • Voronoi diagram: piecewise linear boundary Quiz!

Characteristics of KNNC • Strengths of KNNC • Intuitive • No computation for model construction • Weakness of KNNC • Massive computation required when dataset is big • No straightforward way • To determine the value of K • To rescale the dataset along each dimension Quiz!

Preprocessing/Variants for KNNC Quiz! • Preprocessing: • Z normalization to have zero mean and unit variance along each feature • Value of K obtained via trials and errors • Variants: • Weighted votes • Nearest prototype classification • Edited nearest neighbor classification • k+k-nearest neighbor

Demos by Cleve Moler • Cleve’s Demos of Delaunay triangles and Voronoi diagram • books/dcpr/example/cleve/vshow.m

Natural Examples of Voronoi Diagrams (1/2)

Natural Examples of Voronoi Diagrams (2/2)

1NNC Decision Boundaries • 1NNC Decision boundaries

1NNC Distance/Posterioras Surfaces and Contours

Using Prototypes in KNNC • No. of prototypes for each class is 4.

Decision Boundaries of Different Classifiers Quadratic classifier 1NNC classifier Naive Bayes classifier

Naive Bayes Classifiers(NBC) J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Assumptions & Characteristics • Assumptions: • Statistical independency between features • Statistical independency between samples • Each feature governed by a feature-wise parameterized PDF (usually a 1D Gaussian) • Characteristics • Simple and easy (That’s why it’s named “naive”.) • Highly successful in real-world applications regardless of the strong assumptions

Training and Test Stages of NBC Quiz! • Training stage • Identify class PDF, as follows. • Identify feature PDF by MLE for 1D Gaussians • Class PDF is the product of all the corresponding feature PDFs • Test stage • Assign a sample to the class by taking class prior into consideration:

NBC for Iris Dataset (1/3) Scatter plot of Iris dataset (with only the last two dim.) PDF on each features and each class

NBC for Iris Dataset (2/3) PDF for each class

NBC for Iris Dataset (3/3) Decision boundaries (using the last 2 features)

Strength and Weakness of NBC Quiz! • Strength • Fast computation during training and evaluation • Robust than QC • Weakness • No able to deal with bi-modal data correctly • Class boundary not as complex as QC

Quadratic Classifiers(QC) J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Review: PDF Modeling • Goal: • Find a PDF (probability density function) that can best describe a given dataset • Steps: • Select a class of parameterized PDF • Identify the parameters via MLE (maximum likelihood estimate) based on a given set of sample data • Commonly used PDFs: • Multi-dimensional Gaussian PDF • Gaussian mixture models (GMM)

PDF Modeling for Classification • Procedure for classification based on PDF • Training stage: PDF modeling of each class based on the training dataset • Test stage: For each entry in the test dataset, pick the class with the max. PDF • Commonly used classifiers: • Quadratic classifiers, with n-dim. Gaussian PDF • Gaussian-mixture-model classifier, with GMM PDF

1D Gaussian PDF Modeling 1D Gaussian PDF: MLE of m (mean) and s2 (variance)

D-dim. Gaussian PDF Modeling D-dim Gaussian PDF: MLE of m (mean) and S (covariance matrix)

2D Gaussian PDF Contours PDF Bivariate normal distribution:

Training and Test Stages of QC Quiz! • Training stage • Identify the Gaussian PDF of each class via MLE • Test stage • Assign a sample point to the class C by taking class prior into consideration:

Characteristics of QC Quiz! • If each class is modeled by an Gaussian PDF, the decision boundary between any two classes is a quadratic function. • That is why it is called quadratic classifier. • How to prove it?

QC on Iris Dataset (1/3) • Scatter plot of Iris dataset (with only the last two dim.)

QC on Iris Dataset (2/3) • PDF for each class:

QC on Iris Dataset (3/3) • Decision boundaries among classes:

Strength and Weakness of QC Quiz! • Strength • Easy computation when the dimension d is small • Efficient way to compute leave-one-out cross validation • Weakness • The covariance matrix (d by d) is big when the dimension d is median large • The inverse of the covariance matrix may not exist • Cannot handle bi-modal data correctly

Modified Versions of QC Quiz! • How to modify QC such that it won’t deal with a big covariance matrix? • Make the covariance matrix diagonal Equivalent to naïve Bayes classifiers (proof?) • Make the covariance matrix a constant times an identity matrix

Performance Evaluation:Accuracy Estimate forClassification and Regression J.-S. Roger Jang (張智星) jang@mirlab.org http://mirlab.org/jang MIR Lab, CSIE Dept. National Taiwan University

Introduction to Performance Evaluation • Performance evaluation: An objective procedure to derive the performance index (or figure of merit) of a given model • Typical performance indices • Accuracy • Recognition rate: ↗（望大） • Error rate: ↘（望小） • RMSE (root mean squared error): ↘（望小） • Computation load: ↘ （望小）

Synonyms • Sets of synonyms to be used interchangeably (since we are focusing on classification): • Classifiers, models • Recognition rate, accuracy • Training, design-time • Test, run-time

Performance Indices for Classifiers Quiz! • Performance indices of a classifier • Recognition rate • How to have an objective method or procedure to derive it • Computation load • Design-time computation (training) • Run-time computation (test) • Our focus • Recognition rate and the procedures to derive it • The estimated accuracy depends on • Dataset partitioning • Model (types and complexity)

Methods for Performance Evaluation • Methods to derive the recognition rates • Inside test (resubstitution accuracy) • One-side holdout test • Two-side holdout test • M-fold cross validation • Leave-one-out cross validation

Dataset Partitioning For model construction For model evaluation Dataset Training set Test set Training set Validating set For model selection For model construction • Data partitioning: to make the best use of the dataset • Training set • Training and test sets • Training, validating, and test sets Disjoint!

Why It’s So Complicated? • To avoid overfitting! • Overfitting in regression • Overfitting in classification

Inside Test (1/2) • Dataset partitioning • Use the whole dataset for training & evaluation • Recognition rate (RR) • Inside-test or resubstitution recognition rate

Inside Test (2/2) • Characteristics • Too optimistic since RR tends to be higher • For instance, 1-NNC always has an RR of 100%! • Can be used as the upper bound of the true RR. • Potential reasons for low inside-test RR: • Bad features of the dataset • Bad method for model construction, such as • Bad results from neural network training • Bad results from k-means clustering

One-side Holdout Test (1/2) • Dataset partitioning • Training set for model construction • Test set for performance evaluation • Recognition rate • Inside-test RR • Outside-test RR

Machine Learning: Regression or Classification?