SVM-light Text Classification Guide

Text Classification using SVM-light DSSI 2008 Jing Jiang

Text Classification • Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories • Examples • To classify news articles into “business” and “sports” • To classify Web pages into personal home pages and others • To classify product reviews into positive reviews and negative reviews • Approach: supervised machine learning • For each pre-defined category, we need a set of training documents known to belong to the category. • From the training documents, we train a classifier.

Overview • Step 1—text pre-processing • to pre-process text and represent each document as a feature vector • Step 2—training • to train a classifier using a classification tool (e.g. SNoW, SVM-light) • Step 3—classification • to apply the classifier to new documents

Pre-processing: tokenization • Goal: to separate text into individual words • Example: “We’re attending a tutorial now.”  we ’re attending a tutorial now • Tool: • Word Splitter http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS

Pre-processing: stop word removal (optional) • Goal: to remove common words that are usually not useful for text classification • Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. • Stop word list: • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

Pre-processing: stemming (optional) • Goal: to normalize words derived from the same root • Examples: • attending  attend • teacher  teach • Tool: • Porter stemmer http://tartarus.org/~martin/PorterStemmer/

Pre-processing: feature extraction • Unigram features: to use each word as a feature • To use TF (term frequency) as feature value • To use TF*IDF (inverse document frequency) as feature value • IDF = log (total-number-of-documents / number-of-documents-containing-t) • Bigram features: to use two consecutive words as a feature • Tool: • Write your own program/script • Lemur API

Using Lemur to Extract Unigram Features Index *ind = IndexManager::openIndex("index-file.key"); int d1; TermInfoList *tList = ind->termInfoList(d1); tList->startIteration(); while (tList->hasMore()) { TermInfo * entry = tList->nextEntry(); cout << "entry term id: " << entry->termID() << endl; cout << "entry term count: " << entry->termCount() << endl; } delete dList; delete ind;

SVM (Support Vector Machines) • A learning algorithm for classification • General for any classification problem (text classification as one example) • Binary classification • Maximizes the margin between the two different classes

picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf

SVM-light • SVM-light: a command line C program that implements the SVM learning algorithm • Classification, regression, ranking • Download at http://svmlight.joachims.org/ • Documentation on the same page • Two programs • svm_learn for training • svm_classify for classification

SVM-light Examples • Input format 1 1:0.5 3:1 5:0.4 -1 2:0.9 3:0.1 4:2 • To train a classifier from train.data • svm_learn train.data train.model • To classify new documents in test.data • svm_classify test.data train.model test.result • Output format • Positive score  positive class • Negative score  negative class • Absolute value of the score indicates confidence • Command line options • -c a tradeoff parameter (use cross validation to tune)

More on SVM-light • Kernel • Use the “-t” option • Polynomial kernel • User-defined kernel • Semi-supervised learning (transductive SVM) • Use “0” as the label for unlabeled examples • Very slow

SVM-light Text Classification Guide

SVM-light Text Classification Guide

Presentation Transcript

Music Classification Using SVM

Support Vector Machine (SVM) Classification

Text Classification

TEXT CLASSIFICATION

Support Vector Machine (SVM) Classification

Text Classification

Text Classification

Text Classification

Text Classification

Classification of Drugs by SVM

Text Classification

Text Classification Using Stochastic Keyword Generation

TEXT CLASSIFICATION -----SVM-based Approach

Text Classification

Classification Text

Soil Classification Using Image Processing and Modified SVM Classifier

Text Classification

Support Vector Machine (SVM) Classification

TEXT CLASSIFICATION