1 / 5

Semi-Supervised and Active Learning

Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi-Supervised Learning”. Semi-Supervised and Active Learning . Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples

mills
Download Presentation

Semi-Supervised and Active Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active LearningAn exampleFrom Xu et al., “Training SpamAssassin with Active Semi-Supervised Learning”

  2. Semi-Supervised and Active Learning • Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples • Active learning: Having the learning system decide which examples to ask an oracle to label

  3. Spamassassin • Spamassassin: • Asks users to label e-mail, but they don’t often do it. • Also, they may not label the “most informative” examples. • Spamassassin “self-training”: • Train classifier on small number of labeled examples. • Run these on unlabeled examples. Add the ones classified with high confidence to the original training set. (Problem – the ones classified with high confidence are not necessarily the most informative ones. • Retrain the classifier with the new, larger training set.

  4. Xu et al. paper: Method • Supervised learning: Train Naive Bayes classifier on small subset of (labeled) e-mails. • Semi-supervised learning: Then run Spamassassin’s self-learning method, selecting a large number of new examples to add to training set. Retrain the classifier. • Active learning: Cluster remaining unlabeled e-mails using k-means (on term-frequency feature vectors) with Euclidean distance. Select q representative unlabeled e-mails, first from “pure” clusters, then from “impure clusters”, making sure that many clusters are sampled from. The e-mails selected from each cluster are the ones closest to the cluster centroids. Ask the user to label these q examples. For each of these q examples, if the corresponding cluster is “pure”, propagate this label to a fraction p of the that cluster. Add the newly labeled examples to the training set, and retrain the classifier.

  5. Xu et al. paper: Results • Ran on a large corpus (75K) of e-mails.

More Related