250 likes | 482 Views
Active Learning for Imbalanced Sentiment Classification . Shoushan Li, Shengfeng Ju , Guodong Zhou , and Xiaojun Li. 2012. Presented by: Veronica Perez. Introduction. Sentiment analysis is the task of identifying sentiment polarity of a natural language towards a given topic
E N D
Active Learning for Imbalanced Sentiment Classification Shoushan Li, ShengfengJu, GuodongZhou, and Xiaojun Li. 2012. Presented by: Veronica Perez
Introduction • Sentiment analysis is the task of identifying sentiment polarity of a natural language towards a given topic • Sentiment analysis usually learns from data driven methods which depend on a large amount of labeled data • Data annotation is expensive and time consuming • Semi supervised methods have been proposed to effectively use smaller amounts of labeled data along with a large amount of unlabeled data
What about imbalanced data? • Existing supervised methods assume a balance between positive and negative samples • Imbalanced data is a common problem for the sentiment classification task • In this problem there are many more samples of one class (majority class, MA) than the other class (minority class, MI) • Which class is more informative? • Can we benefit from a active learning approach?
Imbalanced sentiment classification Main problem • How to select the most informative MI samples? • Uncertainty ? • Certainty? • Is it possible to balance between uncertainty sampling and certainty sampling in imbalanced sentiment classification?
Uncertainty vs. Certaintymeasurements for the sentimentanalysisproblem Posterior probability of the document d belonging to the class This measurement is required to guarantee the selection of MI samples
Co-selecting with feature space classifiers Loop N iterations (1). Randomly select a feature subset SF with size r (with the proportion =r/m) from F (2). Generate a feature subspace from SF and train a corresponding feature subspace classifier Ccerwith L (3). Generate another feature subspace from the complement set of SF , i.e., SF F and train a corresponding feature subspace classifier Cuncerwith L F-Fs F Fs Cuncer Ccer L
U Ccer (4). Use Ccer to select top certain k positive and k negative samples, denoted as a sample set CER1 (5). Use Cuncer to select the most uncertain positive sample and negative sample from CER1 (6). Manually annotate the two selected samples (7). If the annotated labels of the two selected samples are different from each other: Add the two newly-annotated samples into L CER1 top K positives K negatives Cuncer pos neg Manual Annotation 1 pos 1 neg pos L neg Discard
Co-selecting with Selected MA Samples Automatically Labeled Loop N iterations (1)Randomly select a proportion of features (with the proportion ) from F to get a feature subset Fs (2). Generate a feature subspace from Fs and train a corresponding subspace classifier Ccerwith L (3). Generate another feature subspace from the complement set of Fs , i.e., F Fs and train a corresponding subspace classifier Cuncer with L F-Fs F Fs Cuncer Ccer L
(4). Use Ccer to select top certain k positive and k negative samples, denoted as a sample set CER1 (5). Use Cuncer to select the most uncertain positive sample and negative sample from CER1 (6). Manually annotate the sample that is predicted as a MI sample by Ccerand automatically annotate the sample that is predicted as majority class (7). If the annotated labels of the two selected samples are different from each other: Add the two newly-annotated samples into L U Ccer CER1 top K positives K negatives Cuncer pos neg Annotate MI manually, MA automatically 1 pos 1 neg pos L neg Discard
Experiments • Dataset • Consists in four domains: Book, DVD, Electronic, and Kitchen (Blitzer et al., 2007). • Experimental setup • Initial labeled and balanced data: 50 positive and 50 negative examples • Unlabeled data: 2000 negative samples, and 14580/12160/7140/7560 positive samples from the four domains respectively • Test data: 800 negative samples and 800 positive samples randomly extracted • Classification Algorithm • Maximum entropy ME implemented with the Mallet tool • SVM implementation from Light SVM (for the margin base active learning) • Features: unigram words with Boolean weights
Performance comparison Evaluation metric: geometric mean Theta = 1/16 K =50
Questions • "The most informative MA samples are automatically labeled using the predicted labels provided by the first classifier." How is the automatic labeling done? • Two feature subspaces are selected, one Fs and the other F - Fs. Then two subspace classifiers are employed, one for choosing the certain samples and the other for choosing the uncertain samples from those certain samples for manual annotation. • Does it mean that both the classifiers are applied once on Fs and once on F - Fs? • Since the classifier is applied on L and it is certain about it, then why is it necessary to apply another classifier to determine the uncertainty?
Questions • The author assumes an imbalanced classification here. Does that mean that this approach is not a generic approach? • Also, how can we know whether the distribution is imbalanced or not without knowing the labels? Do we judge it from the labels obtained ? • The author mentioned that co-training would generate errors in iterations for imbalanced data. Is that true? • While co-selecting, MA samples are trusted to be correct always. Don't you think that this is not a very good practice? • Both Co-selecting basic and co-selecting plus selects relatively higher samples for annotation compared to other active learning methods. How can we conclude that co-selecting is better than others? (is that just by the accuracy? if yes, did the authors compare the accuracy of other active methods too?)
Questions • This approach seems to be appropriate if there are only two types(positive, negative here) of sentiment classification. • What if we need to classify emotion? (sad, anger, happy, neutral etc...). • Does this approach still be applicable with little modifications? • In co-selecting section, why are they using complement set of feature subspace to train uncertainty classifier? I mean why is that strategy is adopted?
Questions • Is it contradictory to (step 1) select some most certain samples to form a set, and then (step 2) select some most uncertain samples from that set? In my view, since the samples are already the top most certain ones, even you use the uncertainty method, all these samples will not have much differences between them. • What is the performance if we take the step 2 off, compared to their solution with step 2?
Questions • The author assumes an imbalanced classification here. Does that mean that this approach is not a generic approach? • Also, how can we know whether the distribution is imbalanced or not without knowing the labels? Do we judge it from the labels obtained ? • The author mentioned that co-training would generate errors in iterations for imbalanced data. Is that true? • While co-selecting, MA samples are trusted to be correct always. Don't you think that this is not a very good practice? • Both Co-selecting basic and co-selecting plus selects relatively higher samples for annotation compared to other active learning methods. How can we conclude that co-selecting is better than others? (is that just by the accuracy? if yes, did the authors compare the accuracy of other active methods too?)
Questions 1. What are some reasons why the performance may vary among the domains? for example books vs. kitchen. 2. It doesn't appear that the authors took into consideration the calculation time for determining the best instance to annotate. Is this a concern?
Questions • Keeping class balance • What happens when a sample is manually annotated as MA ? Do we discard it? Keep it for use as a future "automatically labeled" instance? • Selecting most uncertain from the most certain • The algorithm first selects the $k$ most certain samples (nearest to 1), and then selects the most uncertain sample (nearest to 0.5) from that subset. • This is exactly the same as selecting the $k$th most certain sample, correct? • If all of the top $k$ certain samples are very close to P=1 , what again is the benefit to ignoring the more certain samples? It appears that this question is looked at in Figure 7, however are the results statistically significant (some are hardly a percentage point different)? • If we automatically label the instances instead of manually, is that the same as the Certainty approach under the Experimental Results? If so, that is quite a big performance jump between the Certainty and the Co-selecting approaches! • Co-selecting-basic vs. co-selecting-plus • Why do you think there is such a big performance jump by automatically labeling the majority class (Figure 4)? Wouldn't the intuition having both instances manually labeled would result in better performance?
Questions • What is binary classification problem? • What can be the other imbalanced classification task where co-selecting approach dominates over other active learning tasks?
Questions • What is the objective of automatic labeling an MA sample? why not manually? • Because of keeping a balance dataset, they add only 2 newly annotated samples (1 negative and 1 positive) in each iteration. In case, they got 2 samples from same class, I understand, they do not keep them. But both samples were already annotated (at least one was manually annotated). So, they used some time for labeling samples that maybe they won't use. • So, how much do you think this issue affects in the time used for increasing the Labeled set?