160 likes | 220 Views
Learn how to classify text documents into categories using SVM-light algorithm for text classification. Includes pre-processing steps, training, and classification techniques.
E N D
Text Classification using SVM-light DSSI 2008 Jing Jiang
Text Classification • Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories • Examples • To classify news articles into “business” and “sports” • To classify Web pages into personal home pages and others • To classify product reviews into positive reviews and negative reviews • Approach: supervised machine learning • For each pre-defined category, we need a set of training documents known to belong to the category. • From the training documents, we train a classifier.
Overview • Step 1—text pre-processing • to pre-process text and represent each document as a feature vector • Step 2—training • to train a classifier using a classification tool (e.g. SNoW, SVM-light) • Step 3—classification • to apply the classifier to new documents
Pre-processing: tokenization • Goal: to separate text into individual words • Example: “We’re attending a tutorial now.” we ’re attending a tutorial now • Tool: • Word Splitter http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS
Pre-processing: stop word removal (optional) • Goal: to remove common words that are usually not useful for text classification • Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. • Stop word list: • http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
Pre-processing: stemming (optional) • Goal: to normalize words derived from the same root • Examples: • attending attend • teacher teach • Tool: • Porter stemmer http://tartarus.org/~martin/PorterStemmer/
Pre-processing: feature extraction • Unigram features: to use each word as a feature • To use TF (term frequency) as feature value • To use TF*IDF (inverse document frequency) as feature value • IDF = log (total-number-of-documents / number-of-documents-containing-t) • Bigram features: to use two consecutive words as a feature • Tool: • Write your own program/script • Lemur API
Using Lemur to Extract Unigram Features Index *ind = IndexManager::openIndex("index-file.key"); int d1; TermInfoList *tList = ind->termInfoList(d1); tList->startIteration(); while (tList->hasMore()) { TermInfo * entry = tList->nextEntry(); cout << "entry term id: " << entry->termID() << endl; cout << "entry term count: " << entry->termCount() << endl; } delete dList; delete ind;
SVM (Support Vector Machines) • A learning algorithm for classification • General for any classification problem (text classification as one example) • Binary classification • Maximizes the margin between the two different classes
picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
SVM-light • SVM-light: a command line C program that implements the SVM learning algorithm • Classification, regression, ranking • Download at http://svmlight.joachims.org/ • Documentation on the same page • Two programs • svm_learn for training • svm_classify for classification
SVM-light Examples • Input format 1 1:0.5 3:1 5:0.4 -1 2:0.9 3:0.1 4:2 • To train a classifier from train.data • svm_learn train.data train.model • To classify new documents in test.data • svm_classify test.data train.model test.result • Output format • Positive score positive class • Negative score negative class • Absolute value of the score indicates confidence • Command line options • -c a tradeoff parameter (use cross validation to tune)
More on SVM-light • Kernel • Use the “-t” option • Polynomial kernel • User-defined kernel • Semi-supervised learning (transductive SVM) • Use “0” as the label for unlabeled examples • Very slow