110 likes | 255 Views
Semi-Supervised Time Series Classification. Li Wei Eamonn Keogh University of California, Riverside {wli, eamonn}@cs.ucr.edu. Indexing of Handwritten Documents.
E N D
Semi-Supervised Time Series Classification Li Wei Eamonn Keogh University of California, Riverside {wli, eamonn}@cs.ucr.edu
Indexing of Handwritten Documents There has been a recent explosion of interest in indexing handwritten documents. Note that simply treating the words as “time series” (see Figure 1) is an extremely competitive approach for classifying (and thus indexing) handwritten documents. Handwriting classifiers must be trained on each individual’s particular handwriting. However the cost of obtaining labeled data for each word, for every individual is very expensive as measured in human time. A semi-supervised approach where a user annotates just a few training examples would have great utility. A) A) B) B) C) C) Figure 1: A) A sample of text written by George Washington. B) The word “Alexandria” after having its slant removed. C) A time series created by tracing the upper profile of the word (Image courtesy of Raghavan Manmatha, used with permission)
? F1 M1 M2 (U1)F5 (U2)F4 (U3)F3 (U4)F2 (U5)M6 (U6)M5 (U7)M4 (U8)M3 Labeled Training Instances ? F1 M1 M2 Labeled Training Instances U1 U2 U3 U4 U5 U6 U7 U8 Unlabeled Instances Value of Unlabeled Data Unlabeled data do contain information which can help classification. For example in Figure 2, we need to classify the instance marked with “?”, which clearly belongs to the F (female) class. However this particular image happens to show the actor in a pose which is very similar to one of the M (male) instances, M1, and is thus misclassified. Note that F1 is a very close match to the unlabeled instance U4, and we could simply change the label from U4 to F2, and add it to our dataset of labeled instances. In fact, the basic tenet of semi-supervised learning is that we can do this repeatedly, and thus end up with the situation shown in Figure 3. Figure 2: A simple example to motivate semi-supervised classification. The instance to be classified (marked with “?”) is actually a F (female) but happens to be closer to a M (male) in this small dataset of labeled instances Figure 3: The small dataset of labeled instances shown in Figure 2 has been augmented by incorporating the previously unlabeled examples. Now the instance to be classified (marked with “?”) is closest to F5, and is correctly classified
Training the Classifier We let the classifier teach itself by its own predication. For example in Figure 4 we have a two-class dataset, where initially only one example is known as positive (the solid square in subplot A). In subplot B, we can see the chaining effect of semi-supervised learning: a positive example is labeled which helps labeling other positive examples and so on. Eventually all positive examples are correctly classified. In contrast, if we simply put the seventeen nearest neighbors of the single labeled example to the positive class, we will get very poor accuracy (see subplot C). A) Single positively labeled example Positive Class Negative Class B) Single positively labeled example Added in first iteration Added in second iteration … Added in seventeenth iteration Positive Class Negative Class C) Single positively labeled example Positive Class Negative Class Figure 4: Semi-supervised training on a simple two-class dataset yields much higher accuracy than a naive k-nearest-neighbor classifier
Stopping Heuristic I Because we do not know the ground truth of the data, it is very hard (if not impossible) to know the true performance of the classifier. Fortunately, the distance statistics give us some hint about how well the classifier is doing. In Figure 4, we can see that the minimal nearest neighbor distance decreases dramatically in the first few iterations, stabilizes for a relatively long time, and drops again. Interestingly, the precision-recall breakeven point achieved by the classifier has a corresponding trend of increasing, stabilizing, and decreasing. 1 0.8 breakeven point Precision-recall 0.6 0.4 0.2 adding positive examples find the closest pair moving into negative space 2.5 2.5 2 P 1.5 Distance between the closest pair in 1 0.5 0 50 100 150 200 250 300 350 400 Number of Iteration Figure 5: Changing of the minimal nearest neighbor distance of the labeled set on ECG dataset
0.2 Positive Negative Positive Negative Positive Negative Positive Negative 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 Stopping Heuristic II In hindsight, the phenomenon in Figure 5 is not surprising. In the first few iterations, the labeled positive set is relatively small. By adding more positive examples into it, the space gets denser, and as a result, the minimal nearest neighbor distance decreases. At some point, the closest pair of the positive examples is incorporated in the labeled set. The minimal nearest neighbor distance will be the distance between them. However if a negative example is being labeled as positive, chances are high that we will keep adding negative examples because the negative space is much denser than the positive space. Thus we will see a drop of the minimal nearest neighbor distance of the positive set. Figure 6 illustrates the process on a small sample dataset. 0.2 A) B) 0.1 0 -0.1 -0.2 -0.3 Closest pair in labeled Positive set -0.4 -0.5 -0.6 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.2 0.2 D) C) 0.1 0.1 0 0 -0.1 -0.1 -0.2 -0.2 -0.3 -0.3 Closest pair in labeled Positive set A negative instance is added into labeled positive set -0.4 -0.4 -0.5 -0.5 -0.6 -0.6 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 Figure 6: A sample dataset shown in two-dimensional space. A) Initially the two solid (red) squares are labeled as positive. B) At some point the closest pair in the positive set is added into labeled positive set. C) A negative instance is being added into labeled positive set. D) The closest pair in labeled positive set changes to two negative instances
ECG Dataset 1 0.95 0.9 0.85 Precision-recall breakeven point 0.8 0.75 0.7 0.65 20 40 60 80 100 120 140 160 180 Number of iterations Figure 7: Classification performance on ECG Dataset
1 0.95 0.9 0.85 Precision-recall breakeven point 0.8 Image 595 140 595 120 0.75 100 0 50 100 150 200 250 More likely to be classified as “the” Less likely to be classified as “the” 80 60 Image 19 40 0.7 20 19 0 0 50 100 150 200 250 10 20 30 40 50 60 70 80 90 100 Iterations 0.65 10 20 30 40 50 60 70 80 90 100 Number of iterations Word Spotting Dataset Figure 8: Classification performance on Word Spotting Dataset Figure 9: Ranking changes of two instances in Word Spotting dataset during semi-supervised training
Gun Dataset 0.75 0.7 0.65 0.6 Precision-recall breakeven point 0.55 0.5 0.45 0.4 0.35 5 10 15 20 25 Number of iterations Figure 10: Classification performance on Gun Dataset
Wafer Dataset 0.9 0.85 0.8 0.75 0.7 Precision-recall breakeven point 0.65 0.6 0.55 0.5 0.45 0.4 5 10 15 20 25 30 35 40 45 50 Number of iterations Figure 11: Classification performance on Wafer Dataset
0.92 0.9 0.88 Precision-recall breakeven point 0.86 0.84 0.82 0.8 20 40 60 80 100 120 140 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Number of iterations Yoga Dataset Figure 12: Shapes can be converted to time series. The distance from every point on the profile to the center is measured and treated as the Y-axis of a time series Figure 13: Classification performance on Yoga Dataset