170 likes | 186 Views
The basic notions related to machine learning. Feature extraction. It is a vital step before the actual learning: we have to create the input feature vector Obviously, the optimal feature set is task-dependent Ideally, the features are recommended by an expert of the given domain
E N D
Feature extraction • It is a vital step before the actual learning: we have to create the input feature vector • Obviously, the optimal feature set is task-dependent • Ideally, the features are recommended by an expert of the given domain • In practice, however, we (engineers) have to solve it • Good feature set: contains relevant and few features • In many practical tasks it is not clear what are the relevant features • Eg. influenza – fever: relevant, color of eye: irrelevant, age ??? • When we are unsure, let’s include the feature • It’s not that simple: including irrelevant features makes the learning more difficult for two reasons • Curse of dimensionality • It introduces noise in the data that many algorithms have difficulties to handle
Curse of Dimensionality • Too many features make learning more difficult • Number of features= dimensions of the feature space • Learning becomes harder at larger dimensional spaces • Example: let’s consider the following simple algorithm • Learning: we divide the feature space into little hypercubes, andcount the examples falling into them. We label each cube by theclass that has the most examples in it • Classification: a new test case is always labeled by the label of the cube it falls into • The number of cubes increases exponentially with the number of dimensions! • With a fixed number of examples more and more cubes remain empty • More and more examples are required to reach a certain density of examples • Real learning algorithms are more clever, but the problem is the same • More features we need much more training examples
The effect of irrelevant features • The irrelevant features may make the learning algorithms less efficient • Example: nearest neighbor method • Learning: we simply store the training examples • Classify: we identify a new example by the label of its nearest neighbor • Good features: the points of the same class fall close to each other • What if we include a noise-like feature: the points are randomly scattered along the new dimension, the distance relations fall apart • Most of the learning algorithms are more clever, but their operation is also disturbed by an irrelevant (noise-like) feature
Optimizing the feature space • We usually try to pick the best features manually • But of course, there are also automatic methods for this • Feature selection algorithms • They retain M<N features from the original set of N features • We can reduce the feature space not only by throwing away less relevant features, but also by transforming the feature space • Feature space transformation methods • The new feature are obtained by some combination of the old features • We usually also reduce the number of dimensions at the same time (the new feature space has fewer dimensions than the old one)
Evaluating the trained model • Based on the training examples, the algorithm constructs a model(hypothesis)from the function (x1,…,xN)c függvényre • This function can guess the value of the function for any (x1,…,xN) • Our main goal is not to perfectly learn the labels of the training samples, but togeneralize to examples not seen during training • Hoe can we give an estimate on the generalization ability? • We leave out a subset of the training examples during training test set • Evaluation: • We evaluate the model on the test set estimated class labels • We compare the estimated and the guessed labels
Evaluating the trained model 2 • How to quantify the error of estimation for a regression task: • Example: the algorithm outputs a straight line – the error is shown by the yellow arrows • Summarizing the error indicated by the yellow arrows: • Mean squared error or Root-mean-squared error
Evaluating the trained model 3 • Quantifying the error for aclassification task: • Simplest solution: classification error rate • Number of incorrectly classified test samples/Number of all test samples • More detailed error analysis: with the help of the confusion matrix • It helps understand which classes are missed by algorithm • It also allows defining an error functionthat counts different mistakes bydifferent weights • For this we can define a weight matrixfor the different cells • „0-1 loss”: it weights the elements of themain diagonal by 0, the other cells by 1 • Same as the classification error rate
Evaluating the trained model 4 • We can also weight the different mistakes differently • The most usual when we have only too classes • Example: diagnosing an illness • Thecost matrix is sized 2x2: • Error 1: False negative: the patient is ill, but the machine said no • Error 2: False positive: the machine said yes, but the patient is not ill • These have different costs! • Metrics: see fig. • Metrics preferred by doctors: • Sensitivity: tp/(tp+fn) • Specificity: tn/(tn+fp)
„No Free Lunch” theorem • There exists no such universal learning algorithm that would outperform all other algorithms on all possible tasks • The optimal learning algorithm is always task-dependent • For every learning algorithm one can find task on which it performs well, and task for which it performs poorly • Demonstration: Hypothesis of Method 1 andmethod 2 on the same examples: Which hypothesis is correct?It depends on the real distribution:
„No Free Lunch” theorem 2 • Put another way: The average performance (over „all posible tasks”)of all training algorithms is the same • Ok, but then… what is the sense in constructing machine learning algorithms? • We should concentrate on just one type of tasks rather than trying to solve all tasks by one algorithm! • It makes sense to look for a good algorithm for eg. speech recognition or face recognition • You should be very careful when making claims like algorithm A is better than algorithm B • Machine learning databases: for the purpose of objective evaluation of machine learning algorithms over a broad range of tasks • Pl: UCI Machine Learning Repository
Generalizationvs.overfitting • No Free Lunch theorem: we can never be sure that the trained model generalizes correctly to the cases not seen during training • But then, how should it chose from the possible hypotheses? • Experience: increasing the complexity of the model increases its flexibility, so it becomes and more correct on the training examples • However, its performance starts dropping on the test examples! • This phenomenon is calledoverfitting: after learning the general properties, themodel starts to learn the pecularities of the given finite training set
The „Occam’srazor” heuristics • Experience: usually the simpler model generalizes better • But of course, a too simple model is not good either • Einstein: „Things should be explained as simple as possible. But no simpler.” – this is practically the same as the Occam’s razor heuristics • The optimal model complexity is different for each task • How can we find the optimum pointshown in the figure? • Theoretical approach: we formalizethe complexity of a hypothesis • Minimum Description Length principle:we seek that hypothesis h for which K(h,D)=K(h)+K(D|h) is minimal • K(h): the complexity of hypothesis h • K(D|h): the complexity of representingset D by the hypothesis h • K(): Kolmogorov-complexity
Bias and variance • Another formalism fora model being „too simple” or „toocomplex” • For the case of regression • Example: we fit the red polinomialon theblue points, green is the optimal solution • Polinomial of too low degree: cannot fit on the examplesbias • Too high degree: fits on the examples, butoscillates in between them variance • Formally: • Let’s select a random D training set with n elements, and run the training on them • Repeat this many times, and analyze expectation of thesquared error between the g(x,D) approximationand the original F(x) functionat a given x point
Bias-variance trade-off • Bias: The difference between the average of the estimates and F(x) • If it is not 0, then the model is biased: it has a tendencyto over- or under-estimate the F(x) • By increasing the model complexity (in our example the order of the polinom) the bias decreases • Variance: The variance of the estimates (their average difference from the average estimate) • A large variance is not good(we get quite different estimatesdepending on the choice of D) • Increasing model complexityincreases the variance • Optimum: somewhere in between
Finding the optimal complexity – A practicalapproach • (Almost) all machine learning algorithms have meta-parameters • These allow us to tune the complexity of the model • E.g. polynomial fitting: the degree of the polynomial • These are called meta-parameters (or hyperparameters) , to separate them from the real parameters (eg. polynomials: coefficients) • Different meta-parameter values result in slightly different models • How can we find the optimal meta-parameters? • We separate a small validation (also called development) set from the training set • Over all, our data isdivided into train-dev-test sets • We repeat training on the train set several times with several meta-parameters • We evalute the models obtained on the dev set (to estimate the red curve of Fig.) • Finally, the we evaluate the model that performed best on the dev set on the test
Finding the optimal complexity – Example • Let’s assume that we have two classes, and we want to separate them by a polynomial • The coefficients of the polynomialare the parameters of the model • The degree of the polynomial is the meta-parameter • What happens is we increase the degree? • The optimal degree can be estimated with the help of the independent development set