Understanding Feature Extraction in Machine Learning

The basic notions related to machine learning

Feature extraction • It is a vital step before the actual learning: we have to create the input feature vector • Obviously, the optimal feature set is task-dependent • Ideally, the features are recommended by an expert of the given domain • In practice, however, we (engineers) have to solve it • Good feature set: contains relevant and few features • In many practical tasks it is not clear what are the relevant features • Eg. influenza – fever: relevant, color of eye: irrelevant, age ??? • When we are unsure, let’s include the feature • It’s not that simple: including irrelevant features makes the learning more difficult for two reasons • Curse of dimensionality • It introduces noise in the data that many algorithms have difficulties to handle

Curse of Dimensionality • Too many features make learning more difficult • Number of features= dimensions of the feature space • Learning becomes harder at larger dimensional spaces • Example: let’s consider the following simple algorithm • Learning: we divide the feature space into little hypercubes, andcount the examples falling into them. We label each cube by theclass that has the most examples in it • Classification: a new test case is always labeled by the label of the cube it falls into • The number of cubes increases exponentially with the number of dimensions! • With a fixed number of examples more and more cubes remain empty • More and more examples are required to reach a certain density of examples • Real learning algorithms are more clever, but the problem is the same • More features we need much more training examples

The effect of irrelevant features • The irrelevant features may make the learning algorithms less efficient • Example: nearest neighbor method • Learning: we simply store the training examples • Classify: we identify a new example by the label of its nearest neighbor • Good features: the points of the same class fall close to each other • What if we include a noise-like feature: the points are randomly scattered along the new dimension, the distance relations fall apart • Most of the learning algorithms are more clever, but their operation is also disturbed by an irrelevant (noise-like) feature

Optimizing the feature space • We usually try to pick the best features manually • But of course, there are also automatic methods for this • Feature selection algorithms • They retain M<N features from the original set of N features • We can reduce the feature space not only by throwing away less relevant features, but also by transforming the feature space • Feature space transformation methods • The new feature are obtained by some combination of the old features • We usually also reduce the number of dimensions at the same time (the new feature space has fewer dimensions than the old one)

Evaluating the trained model • Based on the training examples, the algorithm constructs a model(hypothesis)from the function (x1,…,xN)c függvényre • This function can guess the value of the function for any (x1,…,xN) • Our main goal is not to perfectly learn the labels of the training samples, but togeneralize to examples not seen during training • Hoe can we give an estimate on the generalization ability? • We leave out a subset of the training examples during training test set • Evaluation: • We evaluate the model on the test set estimated class labels • We compare the estimated and the guessed labels

Evaluating the trained model 2 • How to quantify the error of estimation for a regression task: • Example: the algorithm outputs a straight line – the error is shown by the yellow arrows • Summarizing the error indicated by the yellow arrows: • Mean squared error or Root-mean-squared error

Evaluating the trained model 3 • Quantifying the error for aclassification task: • Simplest solution: classification error rate • Number of incorrectly classified test samples/Number of all test samples • More detailed error analysis: with the help of the confusion matrix • It helps understand which classes are missed by algorithm • It also allows defining an error functionthat counts different mistakes bydifferent weights • For this we can define a weight matrixfor the different cells • „0-1 loss”: it weights the elements of themain diagonal by 0, the other cells by 1 • Same as the classification error rate

Evaluating the trained model 4 • We can also weight the different mistakes differently • The most usual when we have only too classes • Example: diagnosing an illness • Thecost matrix is sized 2x2: • Error 1: False negative: the patient is ill, but the machine said no • Error 2: False positive: the machine said yes, but the patient is not ill • These have different costs! • Metrics: see fig. • Metrics preferred by doctors: • Sensitivity: tp/(tp+fn) • Specificity: tn/(tn+fp)

„No Free Lunch” theorem • There exists no such universal learning algorithm that would outperform all other algorithms on all possible tasks • The optimal learning algorithm is always task-dependent • For every learning algorithm one can find task on which it performs well, and task for which it performs poorly • Demonstration: Hypothesis of Method 1 andmethod 2 on the same examples: Which hypothesis is correct?It depends on the real distribution:

„No Free Lunch” theorem 2 • Put another way: The average performance (over „all posible tasks”)of all training algorithms is the same • Ok, but then… what is the sense in constructing machine learning algorithms? • We should concentrate on just one type of tasks rather than trying to solve all tasks by one algorithm! • It makes sense to look for a good algorithm for eg. speech recognition or face recognition • You should be very careful when making claims like algorithm A is better than algorithm B • Machine learning databases: for the purpose of objective evaluation of machine learning algorithms over a broad range of tasks • Pl: UCI Machine Learning Repository

Generalizationvs.overfitting • No Free Lunch theorem: we can never be sure that the trained model generalizes correctly to the cases not seen during training • But then, how should it chose from the possible hypotheses? • Experience: increasing the complexity of the model increases its flexibility, so it becomes and more correct on the training examples • However, its performance starts dropping on the test examples! • This phenomenon is calledoverfitting: after learning the general properties, themodel starts to learn the pecularities of the given finite training set

The „Occam’srazor” heuristics • Experience: usually the simpler model generalizes better • But of course, a too simple model is not good either • Einstein: „Things should be explained as simple as possible. But no simpler.” – this is practically the same as the Occam’s razor heuristics • The optimal model complexity is different for each task • How can we find the optimum pointshown in the figure? • Theoretical approach: we formalizethe complexity of a hypothesis • Minimum Description Length principle:we seek that hypothesis h for which K(h,D)=K(h)+K(D|h) is minimal • K(h): the complexity of hypothesis h • K(D|h): the complexity of representingset D by the hypothesis h • K(): Kolmogorov-complexity

Bias and variance • Another formalism fora model being „too simple” or „toocomplex” • For the case of regression • Example: we fit the red polinomialon theblue points, green is the optimal solution • Polinomial of too low degree: cannot fit on the examplesbias • Too high degree: fits on the examples, butoscillates in between them  variance • Formally: • Let’s select a random D training set with n elements, and run the training on them • Repeat this many times, and analyze expectation of thesquared error between the g(x,D) approximationand the original F(x) functionat a given x point

Bias-variance trade-off • Bias: The difference between the average of the estimates and F(x) • If it is not 0, then the model is biased: it has a tendencyto over- or under-estimate the F(x) • By increasing the model complexity (in our example the order of the polinom) the bias decreases • Variance: The variance of the estimates (their average difference from the average estimate) • A large variance is not good(we get quite different estimatesdepending on the choice of D) • Increasing model complexityincreases the variance • Optimum: somewhere in between

Finding the optimal complexity – A practicalapproach • (Almost) all machine learning algorithms have meta-parameters • These allow us to tune the complexity of the model • E.g. polynomial fitting: the degree of the polynomial • These are called meta-parameters (or hyperparameters) , to separate them from the real parameters (eg. polynomials: coefficients) • Different meta-parameter values result in slightly different models • How can we find the optimal meta-parameters? • We separate a small validation (also called development) set from the training set • Over all, our data isdivided into train-dev-test sets • We repeat training on the train set several times with several meta-parameters • We evalute the models obtained on the dev set (to estimate the red curve of Fig.) • Finally, the we evaluate the model that performed best on the dev set on the test

Finding the optimal complexity – Example • Let’s assume that we have two classes, and we want to separate them by a polynomial • The coefficients of the polynomialare the parameters of the model • The degree of the polynomial is the meta-parameter • What happens is we increase the degree? • The optimal degree can be estimated with the help of the independent development set

Understanding Feature Extraction in Machine Learning

Understanding Feature Extraction in Machine Learning

Presentation Transcript

Introduction to Machine Learning

Host Security: Basic Notions

BASIC NOTIONS OF PROBABILITY THEORY

Introduction to Machine Learning

Introduction to machine learning

This week: overview on pattern recognition (related to machine learning)

Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning: Basic Introduction

Introduction to Machine Learning

Basic notions and sources of law

Basic Concepts Related To Kinetics

Prior to Machine Learning

Introduction to Machine Learning

Basic Notions Review

Basic Guidelines to Machine Learning

How Is Machine Learning Related To Artificial Intelligence?

How is Data Science Related to Machine Learning?