340 likes | 354 Views
Explore text classification algorithms, data reduction methods like subsampling, bagging, and feature selection, and the innovative technique of text bundling for improved efficiency and speed in large databases.
E N D
Text Bundling: Statistics-Based Data Reductionby L. Shih, J.D.M. Rennie, Y. Chang and D.R. Karger Presented by Steve Vincent March 4, 2004
Text Classification • Domingos discussed the tradeoff of speed and accuracy in the context of very large databases • Best test classification algorithms are “super-linear”– each additional training point takes more time to train than the previous point.
Text Classification • Most highly accurate text classifiers take a disproportionately large time to handle a large number of training examples • Classifiers become impractical when faced with large data sets such as the OHSUMED data set • The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database • Consists of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991)
Data Reduction • Subsampling • Bagging • Feature selection
Subsampling • Retains a random subset of the original training data • Subsampling does not preserve the entire set of data, rather it preserves all statistics on a random set of the data • Subsampling is fast and easy to implement • Reducing to a single point per class via subsampling yields a single sample document • This gives almost no information about the nature of the class
Bagging • Partition the original data set and learns a classifier on each partition then a test document is labeled by majority vote of the classifiers • This provides fast training due to training on a subset of the original data • Testing is slow since it evaluates multiple classifiers for each test example
Feature Selection • Retains only the best features of a data set • All classifiers use some type of feature selection • If a classifier sees a feature as irrelevant, it simply ignores that feature • One type of feature selection ranks feature according to |p(fi|+) – p(fi|-)|, where p(fi|c) is the empirical frequency of fi in class c of training documents • Little empirical evidence comparing the time reduction of feature selection with the resulting loss in accuracy
Text Bundling • Text bundling generates a new smaller training set by averaging together small groups of points • This preserves certain statistics on all the data instead of just a subset of the data • This application used on one statistic (mean), but it is possible to use multiple statistics
Bundling Algorithm • Tradeoff between speed and accuracy: • Less raw information retained, the faster the classifier will run and the less accurate the results • Each data reduction technique operates by retaining some information and removing other information • By carefully selecting our statistics for a domain, we can optimize the information we retain
Bundling Algorithm • Bundling preserves a set of k user-chosen statistics, s = (s1,…,sk), where si is a function that maps a set of data to a single value.
Global Constraint • There are many possible reduced data sets that can satisfy this constraint • But we don’t only want to preserve the global statistics, we also want to preserve additional information about the distribution • To get a reduced data set that satisfied the global constraint, we could generate several random points and then choose the remaining points to preserve statistics • This does not retain any information about our data except for the chosen statistics
Local Constraint • We can retain some of the information besides the statistics by grouping together set of points and preserving the statistics locally
Local Constraint • The bundling algorithm’s local constraint is to maintain the same statistics between subsets of the training data • Focus on statistics means: • Bundled data will not have any examples in common with the original data • Ensures that certain global statistics are maintained, while also maintaining a relationship between certain partitions of the data in the original and bundled training sets
Text Bundling • First step in bundling is to select a statistic or statistic to preserve: • For text, the mean statistic of each feature is chosen • Rocchio and Multinomial Naïve Bayes perform classification using only the mean statistics of the data
Rocchio Algorithm • Rocchio classification algorithm selects a decision boundary (plane) that is perpendicular to a vector connecting two class centroids • Let {x11,…,x1l1} and {x21,…,x2l2} be set of training data for the positive and negative classes • Let c1= (1/l1)Si x1i and c2= (1/l2)Si x2i be the centroids for the classes • RocchioScore(x)=x ·(c1–c2) • With a threshold boundary, b, an example is labeled according to the sign of the score minus the threshold value: lRocchio(x) = sign (RocchioScore(x) – b)
Naïve Bayes A basic naïve Bayes classifier is as follows: wk are the features (words) used, document , and class cj
Multinomial Naïve Bayes Multinomial Naïve Bayes has shown improvements over other Naïve Bayes types. The formula is: n(wk, ) is the count of word wk in document
Text Bundling • Assume that there are a set of training documents for each class • Apply bundling separately to each class • Let D=(d1,…dn) be a set of documents. • Using the “bag of words” representation, where each word is a feature and is document is represented as a vector of word counts
Text Bundling • di = {di1,…, diV} where the second subscript indexes the words and V is the size of the vocabulary. • Use the mean statistics for each feature as our text statistics. • Define the jth statistic as
Maximal Bundling • Reduce to a single point with the mean statistics • The jth component of the single point is sj(D) • Using a linear classifier on this “maximal bundling” will result in a decision boundary equivalent to Rocchio’s decision boundary
Bundling Algorithms • Randomized bundling • Partition points randomly • Only need one pass over the training points • Poorly preserves data point locations in feature space • Rocchio bundling • Projects points onto a vector • Partitions points that are near one another in the projected space • Use RocchioScore to sort documents by their score, then bundling consecutive sorted documents • Pre-processing time for Rocchio bundling is O(n log(m))
Data Sets • 20 Newsgroups (20 News) • Collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. • Industry Sector (IS) • Collection of corporate web pages organized into categories based on what a company produces or does • There are 9619 non empty documents and 105 categories. • Reuters 21578 (Reuters) • Consists of a set of 10,000 news stories
Experiment • Used Support Vector Machine for classification • Used SvmFu implementation with the penalty for misclassification of training points set at 10 • Coded pre-processing in C++ • Use Rainbow to pre-process the raw documents into feature vectors • Limited runs to 8 hours per run • Compared Bagging, Feature Selection, Subsample, Random Bundling and Rocchio Bundling • Also ran experiment on the OHSUMED, but did not get results for all tests
Future Work • Extend bundling in both a theoretical and empirical sense. • May be possible to analyze or provide bounds on the loss in accuracy due to bundling • Would like to construct general methods for bundling sets of statistics • Would also like to extend bundling to other machine learning domains
References P. Domingos, “When and how to subsample: Report on the KDD-2001 panel”, SIGKDD Explorations, NIST Test Collections (http://trec.nist.gov/data.html) D. Mladenic, “Feature subset selection in text-learning”, Proceedings of the Tenth European Conference on Machine Learning Xiaodan Zhu, “Junk Email Filtering with Large Margin Perceptrons”, University of Toronto, Department of Computer Science (www.cs.toronto.edu/pub/xzhu/reports_nips2.doc )