50 likes | 232 Views
Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi-Supervised Learning”. Semi-Supervised and Active Learning . Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples
E N D
Active LearningAn exampleFrom Xu et al., “Training SpamAssassin with Active Semi-Supervised Learning”
Semi-Supervised and Active Learning • Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples • Active learning: Having the learning system decide which examples to ask an oracle to label
Spamassassin • Spamassassin: • Asks users to label e-mail, but they don’t often do it. • Also, they may not label the “most informative” examples. • Spamassassin “self-training”: • Train classifier on small number of labeled examples. • Run these on unlabeled examples. Add the ones classified with high confidence to the original training set. (Problem – the ones classified with high confidence are not necessarily the most informative ones. • Retrain the classifier with the new, larger training set.
Xu et al. paper: Method • Supervised learning: Train Naive Bayes classifier on small subset of (labeled) e-mails. • Semi-supervised learning: Then run Spamassassin’s self-learning method, selecting a large number of new examples to add to training set. Retrain the classifier. • Active learning: Cluster remaining unlabeled e-mails using k-means (on term-frequency feature vectors) with Euclidean distance. Select q representative unlabeled e-mails, first from “pure” clusters, then from “impure clusters”, making sure that many clusters are sampled from. The e-mails selected from each cluster are the ones closest to the cluster centroids. Ask the user to label these q examples. For each of these q examples, if the corresponding cluster is “pure”, propagate this label to a fraction p of the that cluster. Add the newly labeled examples to the training set, and retrain the classifier.
Xu et al. paper: Results • Ran on a large corpus (75K) of e-mails.