340 likes | 455 Views
Text Bundling: Statistics-Based Data Reduction by L. Shih, J.D.M. Rennie, Y. Chang and D.R. Karger. Presented by Steve Vincent March 4, 2004. Text Classification. Domingos discussed the tradeoff of speed and accuracy in the context of very large databases
E N D
Text Bundling: Statistics-Based Data Reductionby L. Shih, J.D.M. Rennie, Y. Chang and D.R. Karger Presented by Steve Vincent March 4, 2004
Text Classification • Domingos discussed the tradeoff of speed and accuracy in the context of very large databases • Best test classification algorithms are “super-linear”– each additional training point takes more time to train than the previous point.
Text Classification • Most highly accurate text classifiers take a disproportionately large time to handle a large number of training examples • Classifiers become impractical when faced with large data sets such as the OHSUMED data set • The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database • Consists of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991)
Data Reduction • Subsampling • Bagging • Feature selection
Subsampling • Retains a random subset of the original training data • Subsampling does not preserve the entire set of data, rather it preserves all statistics on a random set of the data • Subsampling is fast and easy to implement • Reducing to a single point per class via subsampling yields a single sample document • This gives almost no information about the nature of the class
Bagging • Partition the original data set and learns a classifier on each partition then a test document is labeled by majority vote of the classifiers • This provides fast training due to training on a subset of the original data • Testing is slow since it evaluates multiple classifiers for each test example
Feature Selection • Retains only the best features of a data set • All classifiers use some type of feature selection • If a classifier sees a feature as irrelevant, it simply ignores that feature • One type of feature selection ranks feature according to |p(fi|+) – p(fi|-)|, where p(fi|c) is the empirical frequency of fi in class c of training documents • Little empirical evidence comparing the time reduction of feature selection with the resulting loss in accuracy
Text Bundling • Text bundling generates a new smaller training set by averaging together small groups of points • This preserves certain statistics on all the data instead of just a subset of the data • This application used on one statistic (mean), but it is possible to use multiple statistics
Bundling Algorithm • Tradeoff between speed and accuracy: • Less raw information retained, the faster the classifier will run and the less accurate the results • Each data reduction technique operates by retaining some information and removing other information • By carefully selecting our statistics for a domain, we can optimize the information we retain
Bundling Algorithm • Bundling preserves a set of k user-chosen statistics, s = (s1,…,sk), where si is a function that maps a set of data to a single value.
Global Constraint • There are many possible reduced data sets that can satisfy this constraint • But we don’t only want to preserve the global statistics, we also want to preserve additional information about the distribution • To get a reduced data set that satisfied the global constraint, we could generate several random points and then choose the remaining points to preserve statistics • This does not retain any information about our data except for the chosen statistics
Local Constraint • We can retain some of the information besides the statistics by grouping together set of points and preserving the statistics locally
Local Constraint • The bundling algorithm’s local constraint is to maintain the same statistics between subsets of the training data • Focus on statistics means: • Bundled data will not have any examples in common with the original data • Ensures that certain global statistics are maintained, while also maintaining a relationship between certain partitions of the data in the original and bundled training sets
Text Bundling • First step in bundling is to select a statistic or statistic to preserve: • For text, the mean statistic of each feature is chosen • Rocchio and Multinomial Naïve Bayes perform classification using only the mean statistics of the data
Rocchio Algorithm • Rocchio classification algorithm selects a decision boundary (plane) that is perpendicular to a vector connecting two class centroids • Let {x11,…,x1l1} and {x21,…,x2l2} be set of training data for the positive and negative classes • Let c1= (1/l1)Si x1i and c2= (1/l2)Si x2i be the centroids for the classes • RocchioScore(x)=x ·(c1–c2) • With a threshold boundary, b, an example is labeled according to the sign of the score minus the threshold value: lRocchio(x) = sign (RocchioScore(x) – b)
Naïve Bayes A basic naïve Bayes classifier is as follows: wk are the features (words) used, document , and class cj
Multinomial Naïve Bayes Multinomial Naïve Bayes has shown improvements over other Naïve Bayes types. The formula is: n(wk, ) is the count of word wk in document
Text Bundling • Assume that there are a set of training documents for each class • Apply bundling separately to each class • Let D=(d1,…dn) be a set of documents. • Using the “bag of words” representation, where each word is a feature and is document is represented as a vector of word counts
Text Bundling • di = {di1,…, diV} where the second subscript indexes the words and V is the size of the vocabulary. • Use the mean statistics for each feature as our text statistics. • Define the jth statistic as
Maximal Bundling • Reduce to a single point with the mean statistics • The jth component of the single point is sj(D) • Using a linear classifier on this “maximal bundling” will result in a decision boundary equivalent to Rocchio’s decision boundary
Bundling Algorithms • Randomized bundling • Partition points randomly • Only need one pass over the training points • Poorly preserves data point locations in feature space • Rocchio bundling • Projects points onto a vector • Partitions points that are near one another in the projected space • Use RocchioScore to sort documents by their score, then bundling consecutive sorted documents • Pre-processing time for Rocchio bundling is O(n log(m))
Data Sets • 20 Newsgroups (20 News) • Collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. • Industry Sector (IS) • Collection of corporate web pages organized into categories based on what a company produces or does • There are 9619 non empty documents and 105 categories. • Reuters 21578 (Reuters) • Consists of a set of 10,000 news stories
Experiment • Used Support Vector Machine for classification • Used SvmFu implementation with the penalty for misclassification of training points set at 10 • Coded pre-processing in C++ • Use Rainbow to pre-process the raw documents into feature vectors • Limited runs to 8 hours per run • Compared Bagging, Feature Selection, Subsample, Random Bundling and Rocchio Bundling • Also ran experiment on the OHSUMED, but did not get results for all tests
Future Work • Extend bundling in both a theoretical and empirical sense. • May be possible to analyze or provide bounds on the loss in accuracy due to bundling • Would like to construct general methods for bundling sets of statistics • Would also like to extend bundling to other machine learning domains
References P. Domingos, “When and how to subsample: Report on the KDD-2001 panel”, SIGKDD Explorations, NIST Test Collections (http://trec.nist.gov/data.html) D. Mladenic, “Feature subset selection in text-learning”, Proceedings of the Tenth European Conference on Machine Learning Xiaodan Zhu, “Junk Email Filtering with Large Margin Perceptrons”, University of Toronto, Department of Computer Science (www.cs.toronto.edu/pub/xzhu/reports_nips2.doc )