1 / 18

Machine learning methods – Introduction The main properties of learning algorithms

Machine learning methods – Introduction The main properties of learning algorithms. The goal of machine learning. Goal : To construct programs that are able to improve their performance using the experience collected during their operation

haley
Download Presentation

Machine learning methods – Introduction The main properties of learning algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine learning methods – IntroductionThe main properties of learning algorithms

  2. Thegoal of machine learning • Goal: To construct programs that are able to improve their performance using the experience collected during their operation • Learning algorithm: algorithms that are able to deduct regularities, relationships from a set of training examples • Note 1.: The main aim is not to memorize the actual training examples, but to correctly generalize to other samples not seen during training (also known as inductivelearning) • Assumption:the examples faithfully represent the relationship that we try to learn • Note 2.: We can never be 100% sure that the relationship we found will generalize to unseen data • Because of this, we will call the found relationship a „hypothesis” • After receiving further examples the algorithm may refine the hypothesis

  3. The main types oflearning tasks • Supervised learning: the correct answer is also given with the training examples • The most common task: classification • Example: character recognition: 16x16 pixels letter • 16x16 pixels: input features • Letter: class label • Practically, we have to learn a function from examples • This will be the dominant topic of this semester • Unsupervised learning: no helping information is given • Most common task: clustering • Mapping data points intoautomaticallyfound classesclasses based on some kind similarity measure

  4. The main types oflearning tasks 2 • Modelling processes along time • In the classic function learning task we assume that the samples following each other are independent, or at least come in a random order • On contrary, when modelling time series we assume that the order carries crucial information that must be modelled • Examples: speech recognition,text analysis, modelling stock echange data • Reinforcement learning • Example: artificialliving „creatures” -- autonomous agents • Interaction with the environment, collection of experiences • The experiences have no labels in themselves, only a long-term goal is defined • A special sub-field within machine learning • Other special learning tasks

  5. Supervised learning of functions • The input of the function: a vector of some measurement data • feature vector, attribute vector • The output of the function: class label or a real number • The input of thelearning algorithm: a set of training examples • Output: A hypothesis (model) about the function • It can return the (hypothesized) output value for any input vector • Set of training examples: a set of pairs of a feature vectors and the correspondingclass label • Examples: does the patient have influenza? F e a t u r e v e c t o r Class label (Y/N) training instances

  6. The main properties of a learning system • We have to think about these features when designing a new learning method or when we look for a suitable method for a given task • The type of input/output of the function to be learned • The representation method of the learnedfunction (hypothesis) • Hypothesis space: what is the set of functions that the method selects from • Which hypothesis will it prefer when the are more hypotheses that fit the data • What algorithm is uses to find a/the best hypothesis

  7. The output of the function to be learned • Classification: theoutput value is from a finite, discrete set • Example: character recognition. We have to tell which letter is shown in images of 16x16 pixels. Range of output values= letters of the ABC • The classification task is the typical machine learning task • Concept learning: the function has a binary range • Example: we want to train a robot the notion of “chair”. Each object in its environment either belongs to this notion or not. • Regression: the range of the function is continuous • Example: assessing the value of used cars based on features like brand, age, motor capacity,…

  8. The input of the function to be learned • Binary features • Discrete features • Also called nominal, symbolic orcategoric features • Continuous features • Binarydiscretecontinuous conversion trivial • Discrete binary: • Class labels: learning N class labelscan alwaysbe solved as having N concept learning tasks („one against the rest”) • Features: N different values can be represented by log2N binaryvalues • Continuous Discrete : • Can be solved by quantization(with some error), eg. (fever) 39,7high • Quantization is only for features, less usual for training targets

  9. Why does the type of input/output matter? • Different type of input/output requires a different type of inner representation • Some algorithms work only with a certain type of features/targets • Or they might work with other types of features, but not optimally • Examples: • Concept learning with binary features: we have to learn a boolean function • In the 60-70’s logic formulas were thought to be the best representation of human thinking • A lot of research effort was put into the learning of logic formulas, these algorithms do not work on other types of data • Theclassic SVM algorithmis defined for two classes • Several extensions exist for multi-class tasks

  10. Input/output examples 2 • Theclassic decision tree algorithmswas defined for discrete features • There are several extensions for continuous feature, but these are not really efficient • The Gaussian mixture model of statistical pattern recognition • This assumes continuous features • There is not much sense in fitting Gaussian distributions on discrete features, in many cases the algorithm would crash in practice • Classification in general, when we have continuous features • The characteristic function of each class is a discontinuous function that is hard to represent • There are two general solutions to represent it using continuous functions: • Geometric approach • Decision-theoretic approach

  11. The feature space and the decision boundary • When we have ta feature vector of N components, then our training examples can be displayed as points in an N-dimensional space • Example: • 2 features –>2axes (x1, x2) • Class label: by colors • Goal: to find the decision boundary between the classes • Generally: give an estimate of the (x1,x2)c function based on the training examples • The same as specifying the (x1,x2){0,1}characteristic function (or indicator function)of each ci class

  12. Representing the decision boundary • Direct (geometric) approach: We directly represent the decision surface • Using some simple, continuous function like lines (planes) • Indirect (decision-theoretic) approach: • 1. We assign a function to each class that can tell for any point of the space the probability that thepoint belongs to the given class • 2. The given point is identified by the class label for which the discrimininant function takes the largest value • The boundary between the classes is defined indirectly by the section of the discriminant functions • This way, the classification task is solved indirectly by learning the discriminant fuctions

  13. Further remarks (Input/output) • It is important whether the examples have missing feature values • There exist methods to estimate the missing values • But most algorithms cannot handle these by default • This might happen in several practical tasks (pl. medical diagnostics) • It is important whether the algorithm can handle contradicting examples (same feature vector with different class labels) • There are solutions to this • But some algorithms cannot handle this • It is very frequent in practice • Due to labelling mistakes, e.g. ambiguous diagnosis

  14. Representation of the function to be learned • Symbolic representation numeric representation • This is an ancient debate in AI • 60-70’s : symbolic representation was preferred • E.g.: logic formulas, if-then rules • Currently: numeric representation is preferred • Pl. neural networks the representation consists of a bunch of real numbers • For certain tasks symbolic representation seems to be more suitable • E.g. automatic proving of mathematical theorems • For other tasks it makes no sense • E.g. image recognition • The most important aspect: does the model have to be well-structured, interpretable for human inspection? • Sometimes it does not matter, e.g. speech recognition • Sometimes human understanding is the goal,e.g. medical data mining

  15. What is the hypothesis space used • Hypothesis space: the set of functions from which the algorithm selects the best fitting one • Example: parametric methods • In the case of a continuous feature space most methods use some paramteric curve to represent the function to be learned • Example: regression with 1 variable • We fit a polinomial on the training points • Restricting the hypotheis space: we specify the degree of the polinomial • This restricts the set of possible functions • The parameters that influence the size of the hypothesis space are called meta-parameters • Training = find the optimal parameters of the polinomial • In the example these are the coefficients of the polinomial • These are called the parameters of the model

  16. What is the hypothesis space used 2 • Hypothesis space: the set of functions from which the algorithm selects its hypothesis • Restricting the hypothesis space is technically necessary • Continuous feature space: it is impossible to represent all possible functions • Discrete space: the number of possible functions is finite, so theoretically we could represent all of them, the practically usually there are too many combinations • It is also necessary for efficient (meaning well-generalizing) learning • Generalization requires that the system can give a reply for previously unseen examples • During training, we fit a model (function) on the data from the hypothesis space • The shape of this function plays a critical role in how the system replies to previously unseen data („inductive bias”) • Usually we work with mathematically simple function families • The optimal hypothesis space depends on the actual task!! • Toorestrictedhypothesis space the model won’t be able to learn even the training examples • Toowidehypothesis space it „mugs up”, the training examples, but cannot generalize • Similar to human learning (though we adjust the task to the child, and not the other way round..)

  17. Which one it selects from among the possible hypothesis • Consistent hypothesis: gives a correct return value for the training examples • If there are more than one correct hypotheses, ten we have to chose from them • The training examples cannot help in this! • We need some heuristics for this • The principle of Occam’s razor: „when ther are more than one possible explanations, then usually the simplest one turns out to be right” • Of course, we have to mathematically define the notion of „simplest” • Eg.: minimum description length

  18. What algorithm is used to find the best hypothesis • In the previous step we defined the criterium of the optimal hypothesis • In practice we will frequently define it as a target function • Defining it is not enough, we have to find it somehow • In the case of numerical models, the task of optimizing the target function usually leads to a multivariate global optimization problem • Theoretically, we may use general-purpose global optimization algorithms for this • In most cases, however, we will have a training algorithm specially adjusted for the needs of the actual machine learning model

More Related