310 likes | 512 Views
Roughly overview of Support vector machines. Reference: Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. An Introduction of Information Retrieval, 2008.
Roughly overview of Support vector machines Reference: Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. An Introduction of Information Retrieval, 2008. Support Vector Machines: Training and Application. E. Osuna, et al. MIT A. I. Lab, 1997. An Improved Training Algorithm for Support Vector Machines. E. Osuna, et al. IEEE NNSP’97. A Tutorial on Support Vector Machines for Pattern Recognition. J.C. Burges. Data Mining and Knowledge Discovery, 1998. A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. T.Joachims. NIPS, 1997. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. T.Joachims. 1997. http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html http://www-csli.stanford.edu/~hinrich/newslides.html http://en.wikipedia.org/wiki/Quadratic_programming http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM3.pdf Presenter: Suhan Yu
The main idea of SVM • An SVM is a kind of large-margin classifier • To find a decision boundary between two classes • The subject have started in the late seventies by Vapnik (1979) Vladimir Naumovich Vapnik Russian Master : Mathematics Ph. D : Statistics
The application of SVM • Isolated handwritten digit recognition • Object recognition • Speaker identification • Face detection • Text categorization • Joachims, 1997
Text classification • Earlier • TFIDF classifier • k-NN
Text classification • Earlier • Naïve Bayes Classifier • Rocchio • … • Today • SVM
Why should SVMs Work Well for Text categorization • High dimension input space • Learning text classifiers has to deal with more than 10000 features • Few irrelevant features • The relation between features is high • Document vectors are sparse
The main idea of SVM margin hyperplane
Support vectors Maximize margin Support Vector Machine (SVM) • SVMs maximize the margin around the separating hyperplane. • A.k.a. large margin classifiers • The decision function is fully specified by a subset of training samples, the support vectors. • Quadratic programming problem
Maximum Margin: Formalization • w: decision hyperplane normal • xi: data point i • yi: class of data point i (+1 or -1) NB: Not 1/0 • Classifier is: f(xi) = sign(wTxi + b) • Functional margin of xi is: yi (wTxi + b)
The planar decision surface in data-space for the simple linear discriminant function: X’
Linear Support Vector Machine (SVM) • Hyperplane wT x + b = 0 • Extra scale constraint: mini=1,…,n |wTxi + b| = 1 • This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2 wTxa + b = 1 ρ wTxb + b = -1 wT x + b = 0
Linear SVM Mathematically • Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi,yi)} • For support vectors, the inequality becomes an equality • Then, since each example’s distance from the hyperplane is • The margin is: wTxi+ b≥ 1 if yi= 1 wTxi+ b ≤ -1 if yi= -1
ρ Geometric Margin • Distance from example to the separator is • Examples closest to the hyperplane are support vectors. • Marginρ of the separator is the width of separation between support vectors of classes. x r x′
Linear SVM Mathematically • To summarize: • Quadratic function • A quadratic function f is a function of the form a point x to be a global minimizer is for it to satisfy the Karush-Kuhn-Tucker (KKT) conditions. The KKT conditions are also sufficient when f(x) is convex. Convex function
Linear SVM Mathematically • Lagrange Multiplier • Differentiating:
x 0 x 0 Non-linear SVMs • Datasets that are linearly separable (with some noise) work out great: • But what are we going to do if the dataset is just too hard? • How about … mapping data to a higher-dimensional space: x2 x 0
(0,1) + + - + -1 0 +1 + - (1,0) (0,0) Nonlinear SVMs • Project the linearly inseparable data to high dimensional space where it is linearly separable and then we can use linear SVM
Not linearly separable data. Linearly separable data. Angular degree (phase) polar coordinates 0 5 Distance from center (radius) Need to transform the coordinates: polar coordinates, kernel transformation into higher dimensional space (support vector machines).
Non-linear SVMs: Feature spaces Φ: x→φ(x)
f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Input space (cont’d) • Kernel functions and the kernel trick are used to transform data into a different linearly separable feature space
Soft Margin Classification • If the training set is not linearly separable, slack variablesξi can be added to allow misclassification of difficult or noisy examples. • Allow some errors • Let some points be moved to where they belong, at a cost • Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) ξi ξj
Soft Margin Classification Mathematically • The old formulation: • The new formulation incorporating slack variables: • Parameter C can be viewed as a way to control overfitting – a regularization term Find w and b such that Φ(w) =½ wTw is minimized and for all {(xi,yi)} yi (wTxi+ b)≥ 1 Find w and b such that Φ(w) =½ wTw + CΣξi is minimized and for all {(xi,yi)} yi(wTxi+ b)≥ 1- ξi and ξi≥ 0 for all i
Soft Margin Classification – Solution • The dual problem for soft margin classification: • Neither slack variables ξinor their Lagrange multipliers appear in the dual problem! • Again, xi with non-zero αiwill be support vectors. • Solution to the dual problem is: Find α1…αNsuch that Q(α) =Σαi- ½ΣΣαiαjyiyjxiTxjis maximized and (1)Σαiyi= 0 (2) 0 ≤αi≤ C for all αi w =Σαiyixi b= yk(1- ξk) - wTxkwhere k = argmax αk But w not needed explicitly for classification! f(x) = ΣαiyixiTx + b k
Classification with SVMs • Given a new point (x1,x2), we can score its projection onto the hyperplane normal: • In 2 dims: score = w1x1+w2x2+b. • I.e., compute score: wx + b= ΣαiyixiTx + b • Set confidence threshold t. Score > t: yes Score < -t: no Else: don’t know 7 5 3
Kernels • Why use kernels? • Make non-separable problem separable. • Map data into better representational space • Common kernels • Linear • Polynomial K(x,z) = (1+xTz)d • Radial basis function (infinite dimensional space)
The problem of SVM • Training a SVM using large data sets (5000 samples) is a very difficult problem to approach without some kind of data or problem decomposition [Osuna, 1997]
Features for text • Good feature engineering can often markedly improve the performance of a text classifier • Use terms as features • Document zones • Upweighting document zones • Separate features spaces for document zones • Connections to text summarization • Relevance signal • Cosine score • Title match • Query term proximity is often very indicative of a document being in topic, especially with longer documents and on the web
Result ranking by machine learning • Classification problem v.s. regression problem • Classification problem: categorical variable is predicted • Regression problem: a real number is predicted • Ordinal regression • Ranking is predicted • The goal is to rank a set of documents for a query • Ranking SVM
Ranking SVM • Construct a vector of features for each document/query pair • For two documents, form the vector of feature differences • Another ranking methods • RankNet : using neural network for ranking • Frank : different from cost function