770 likes | 945 Views
A tutorial about SVM. Omer Boehm omerb@il.ibm.com. Outline. Introduction Classification Perceptron SVM for linearly separable data. SVM for almost linearly separable data. SVM for non-linearly separable data. Introduction.
E N D
A tutorial about SVM Omer Boehm omerb@il.ibm.com
Outline • Introduction • Classification • Perceptron • SVM for linearly separable data. • SVM for almost linearly separable data. • SVM for non-linearly separable data.
Introduction • A branch of artificial intelligence, a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data • An important task of machine learning is classification. • Classification is also referred to as pattern recognition.
Example Classes Objects Learning Machine
Types of learning problems • Supervised learning (n class, n>1) • Classification • Regression • Unsupervised learning (0 class) • Clustering (building equivalence classes) • Density estimation
Supervised learning • Regression • Learn a continuous function from input samples • Stock prediction • Input – future date • Output – stock price • Training – information on stack price over last period • Classification • Learn a separation function from discrete inputs to classes. • Optical Character Recognition (OCR) • Input – images of digits. • Output – labeling 0-9. • Training - labeled images of digits. In fact, these are approximation problems
What makes learning difficult Given the following examples How should we draw the line?
What makes learning difficult Which one is most appropriate?
What makes learning difficult The hidden test points
What is Learning (mathematically)? • We would like to ensure that small changes in an input point from a learning point will not result in a jump to a different classification • Such an approximation is called a stable approximation • As a rule of thumb, small derivatives ensure stable approximation
Stable vs. Unstable approximation • Lagrange approximation (unstable) given points, , we find the unique polynomial , that passes through the given points • Spline approximation (stable)given points, , we find a piecewise approximation by third degree polynomials such that they pass through the given points and have common tangents at the division points and in addition :
What would be the best choice? • The “simplest” solution • A solution where the distance from each example is as small as possible and where the derivative is as small as possible
Dot product • The dot product of two vectors Is defined as: • An example
Dot product • where denotes the length (magnitude) of ‘a’ • Unit vector
Plane/Hyperplane • Hyperplane can be defined by: • Three points • Two vectors • A normal vector and a point
Plane/Hyperplane • Let be a perpendicular vector to the hyperplane H • Let be the position vector of some known point in the plane. A point _ with position vector is in the plane iff the vector drawn from to is perpendicular to • Two vectors are perpendicular iff their dot product is zero • The hyperplane H can be expressed as
Solving approximation problems • First we define the family of approximating functions F • Next we define the cost function . This function tells how well performs the required approximation • Getting this done , the approximation/classification consists of solving the minimization problem • A first necessary condition (after Fermat) is • As we know it is always possible to do Newton-Raphson, and get a sequence of approximations
Classification • A classifier is a function or an algorithm that maps every possible input (from a legal set of inputs) to a finite set of categories. • X is the input space, is a data point from an input space. • A typical input space is high-dimensional, for exampleX is also called a feature vector. • Ω is a finite set of categories to which the input data points belong : Ω ={1,2,…,C}. • are called labels.
Classification • Y is a finite set of decisions – the output set of the classifier. • The classifier is a function
Perceptron - FrankRosenblatt (1957) • Linear separation of the input space
Perceptron algorithm • Start:The weight vector is generated randomly,set • Test: A vector is selected randomly,if and go to test, if and go to add, if and go to test, if and go to subtract • Add: go to test, • Subtract: go to test,
Perceptron algorithm Shorter version Update rule for the k+1 iterations (iteration for each data point)
Perceptron - analysis Solution is a linear combination of training points Only uses informative points (mistake driven) The coefficient of a point reflect its ‘difficulty’ The perceptron learning algorithm does not terminate if the learning set is not linearly separable (e.g. XOR)
Advantages of SVM, Vladimir Vapnik 1979,1998 • Exhibit good generalization • Can implement confidence measures, etc. • Hypothesis has an explicit dependence on the data (via the support vectors) • Learning involves optimization of a convex function (no false minima, unlike NN). • Few parameters required for tuning the learning machine (unlike NN where the architecture/various parameters must be found).
Advantages of SVM • From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. • These generalization bounds have two important features:
Advantages of SVM • The upper bound on the generalization error does not depend on the dimensionality of the space. • The bound is minimized by maximizing the margin, i.e. the minimal distance between the hyperplane separating the two classes and the closest data-points of each class.
In an arbitrary-dimensional space, a separating hyperplane can be written : • Where W is the normal. • The decision function would be :
Note argument in is invariant under a rescaling of the form • Implicitly the scale can be fixed by definingas the support vectors (canonical hyperplanes)
The task is to select , so that the training data can be described as: for for • These can be combined into:
The margin will be given by the projectionof the vector onto the normal vector to the hyperplane i.e. So the distance (Euclidian) can be formed
Note that lies on i.e. • Similarly for • Subtracting the two results in
The margin can be put as • Can convert the problem to subject to the constraints: • J(w) is a quadratic function, thus there is a single global minimum
Lagrange multipliers • Problem definition :Maximize subject to • A new λ variable is used , called ‘Lagrange multiplier‘ to define