Indexing and Mining Time Series Data Dr Eamonn Keogh

Indexing and Mining Time Series Data Dr Eamonn Keogh Computer Science & Engineering DepartmentUniversity of California - RiversideRiverside,CA 92521eamonn@cs.ucr.edu

What are Time Series? 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 A time series is a collection of observations made sequentially in time. Note that virtually all similarity measurements, indexing and dimensionality reduction techniques discussed in this tutorial can be used with other data types.

Time Series are Ubiquitous! I • People measure things... • The presidents popularity rating. • Their blood pressure. • The annual rainfall in Brazil. • The value of their Yahoo stock. • The number of web hits per second. • … and things change over time. Thus time series occur in virtually every medical, scientific and businesses domain.

Time Series are Ubiquitous! II A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series (Tufte, 1983).

Time Series are Everywhere… Bioinformatics:Aach, J. and Church, G. (2001). Aligning gene expression time series with time warping algorithms. Bioinformatics. Volume 17, pp 495-508. Robotics:Schmill, M., Oates, T. & Cohen, P. (1999). Learned models for continuous planning. In 7th International Workshop on Artificial Intelligence and Statistics. Chemistry: Gollmer, K., & Posten, C. (1995) Detection of distorted pattern using dynamic time warping algorithm and application for supervision of bioprocesses. IFACCHEMFAS-4 Medicine: Caiani, E.G., et. al. (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEE Computers in Cardiology. Gesture Recognition:Gavrila, D. M. & Davis,L. S.(1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In IEEE IWAFGR Meteorology/ Tracking/ Biometrics / Astronomy / Finance / Manufacturing …

Why is Working With Time Series so Difficult? Part I Answer: How do we work with very large databases? • 1 Hour of EKG data:1 Gigabyte. • Typical Weblog: 5 Gigabytes per week. • Space Shuttle Database: 158 Gigabytes and growing. • Macho Database: 2 Terabytes, updated with 3 gigabytes a day. Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate.

Why is Working With Time Series so Difficult? Part II Answer: We are dealing with subjectivity The definition of similarity depends on the user, the domain and the task at hand. We need to be able to handle this subjectivity.

Why is working with time series so difficult? Part III • Answer: Miscellaneous data handling problems. • Differing data formats. • Differing sampling rates. • Noise, missing values, etc. We will not focus on these issues in this tutorial.

Two Motivating Datasets Cylinder-Bell-Funnel Electrocardiogram (ECGs) c(t) = (6+) X[a,b](t) + (t) b(t) = (6+) X[a,b](t) (t-a)/(b-a) + (t) f(t) = (6+) X[a,b](t) (b-a)/(b-t) + (t) X[a,b] = { 1, if atb, else 0 } Where  and (t) are drawn from a standard normal distribution N(0,1), a is an integer drawn uniformly from the range [16,32] and (b-a) is an integer drawn uniformly from the range [32, 96]. Kadous 1999; Manganaris,1997; Saito 1994; Rodriguez 2000; Geurts 2001 Cylinder Funnel Bell

Here is a simple motivation for the first half of the tutorial. You go to the doctor because of chest pains. Your ECG looks strange… You doctor wants to search a database to find similar ECG, in the hope that they will offer clues about your condition... • How do we define similar? • How do we search quickly? Two questions:

Why do Time Series Similarity Matching? A B C 2500 0 500 1000 1500 2000 A B C 0 20 40 60 80 100 120 Defining the similarity between two time series is at the heart of most time series data mining applications/tasks Clustering Classification Rule Discovery Query by Content 10  s = 0.5 c = 0.3 Anomaly Detection Motif Discovery

The similarity matching problem can come in two flavors I 1: Whole Matching Query Q (template) 6 1 7 2 8 3 C6is the best match. 9 4 10 5 Database C Given a Query Q, a reference database C and a distance measure, find the Cithat best matches Q.

The similarity matching problem can come in two flavors II 2: Subsequence Matching Query Q (template) Database C The best matching subsection. Given a Query Q, a reference database C and a distance measure, find the location that best matches Q. Note that we can always convert subsequence matching to whole matching by sliding a window across the long sequence, and copying the window contents.

D(Q,C) The Minkowski Metrics C Q p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p =  Max (Supremum, “sup”)

C Q D(Q,C) Euclidean Distance Metric Given two time series Q = q1…qn and C = c1…cn their Euclidean distance is defined as: Ninety percent of all work on time series uses the Euclidean distance measure.

Optimizing the Euclidean Distance Calculation Instead of using the Euclidean distance we can use the Squared Euclidean distance Euclidean distance and Squared Euclidean distance are equivalent in the sense that they return the same rankings, clusterings and classifications. This optimization helps with CPU time, but most problems are I/O bound.

Preprocessing the data before distance calculations • If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results. • This is because Euclidean distance is very sensitive to some distortions in the data. For most problems these distortions are not meaningful, and thus we can and should remove them. • In the next 4 slides I will discuss the 4 most common distortions, and how to remove them. • Offset Translation • Amplitude Scaling • Linear Trend • Noise

3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Transformation I: Offset Translation D(Q,C) Q = Q - mean(Q) C = C - mean(C) D(Q,C) 0 50 100 150 200 250 300

0 100 200 300 400 500 600 700 800 900 1000 Transformation II: Amplitude Scaling 0 100 200 300 400 500 600 700 800 900 1000 Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) D(Q,C)

5 4 3 2 12 1 10 0 8 -1 6 -2 4 -3 0 20 40 60 80 100 120 140 160 180 200 2 0 -2 5 -4 0 20 40 60 80 100 120 140 160 180 200 4 3 2 1 0 -1 -2 -3 0 20 40 60 80 100 120 140 160 180 200 Transformation III: Linear Trend Removed offset translation Removed amplitude scaling Removed linear trend Removed offset translation The intuition behind removing linear trend is this. Fit the best fitting straight line to the time series, then subtract that line from the time series. Removed amplitude scaling

8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Transformation IIII: Noise Q = smooth(Q) C = smooth(C) The intuition behind removing noise is this. Average each datapoints value with its neighbors. D(Q,C)

3 2 9 6 8 5 7 4 1 9 8 7 5 6 4 3 2 1 A Quick Experiment to Demonstrate the Utility of Preprocessing the Data Clustered using Euclidean distance on the raw data. Clustered using Euclidean distance on the raw data, after removing noise, linear trend, offset translation and amplitude scaling. Instances from Cylinder-Bell-Funnel with small, random amounts of trend, offset and scaling added.

Summary of Preprocessing The “raw” time series may have distortions which we should remove before clustering, classification etc. Of course, sometimes the distortions are the most interesting thing about the data, the above is only a general rule. We should keep in mind these problems as we consider the high level representations of time series which we will encounter later (Fourier transforms, Wavelets etc). Since these representations often allow us to handle distortions in elegant ways.

Weighted Distance Measures I Intuition: For some queries different parts of the sequence are more important. Weighting features is a well known technique in the machine learning community to improve classification and the quality of clustering.

Weighted Distance Measures II D(Q,C) D(Q,C,W) The height of this histogram indicates the relative importance of that part of the query W

Term Vector [Jordan , Cow, Bull, River] Term Weights [ 1 , 1 , 1 , 1 ] Term Vector [Jordan , Cow, Bull, River] Term Weights [ 1.1 , 1.7 , 0.3 , 0.9 ] Weighted Distance Measures IIIHow do we set the weights? One Possibility: Relevance Feedback Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query. Search Display Results Gather Feedback Update Weights

Relevance Feedback for Time Series The original query The weigh vector. Initially, all weighs are the same. Note: In this example we are using a piecewise linear approximation of the data. We will learn more about this representation later.

The initial query is executed, and the five best matches are shown (in the dendrogram) One by one the 5 best matching sequences will appear, and the user will rank them from between very bad (-3) to very good (+3)

Based on the user feedback, both the shape and the weigh vector of the query are changed. The new query can be executed. The hope is that the query shape and weights will converge to the optimal query.

Other Distance Measures for Time Series

Other Distance Measures for Time Series In the past decade, there have been dozens of alternative distance measures for time series introduce into the literature. They are all a complete waste of time!

Subjective Evaluation of Similarity Measures 8 8 6 6 4 7 3 5 2 4 7 3 5 2 1 1 A novel distance measure introduced into the literature Euclidean Distance I believe that one of the best (subjective) ways to evaluate a proposed similarity measure is to use it to create a dendrogram of several time series from the domain of interest.

Results: Classification Error Rates

Dynamic Time Warping Fixed Time Axis Sequences are aligned “one to one”. “Warped” Time Axis Nonlinear alignments are possible. Note: We will first see the utility of DTW, then see how it is calculated.

Utility of Dynamic Time Warping: Example I, Machine Learning This dataset has been studied in a machine leaning context by many researchers. Kadous 1999; Manganaris,1997; Saito 1994; Rodriguez 2000; Geurts 2001; Recall that by definition, the instances of Cylinder-Bell-Funnel are warped in the time axis. Cylinder-Bell-Funnel Where  and (t) are drawn from a standard normal distribution N(0,1), a is an integer drawn uniformly from the range [16,32] and (b-a) is an integer drawn uniformly from the range [32, 96].

Classification experiment on Cylinder-Bell-Funnel dataset • Training data consists of 10 exemplars from each class. • (One) Nearest Neighbor Algorithm. • “Leaving-one-out” evaluation, averaged over 100 runs. • Error rate using Euclidean Distance 26.10% • Error rate using Dynamic Time Warping 2.87% • Time to classify one instance using Euclidean Distance 1 sec • Time to classify one instance using Dynamic Time Warping 4,320 sec Dynamic time warping can reduce the error rate by an order of magnitude! Its classification accuracy is competitive with sophisticated approaches like Decision Trees, Boosting, Neural Networks, and Bayesian Techniques. But it is slow...

Sunday Friday Saturday Thursday Monday Tuesday Wednesday Utility of Dynamic Time Warping: Example II, Data Mining Power-Demand Time Series. Each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997 [van Selow 1999]. Wednesday was a national holiday

For both dendrograms the two 5-day weeks are correctly grouped. Note however, that for Euclidean distance the three 4-day weeks are not clustered together. and the two 3-day weeks are also not clustered together. In contrast, Dynamic Time Warping clusters the three 4-day weeks together, and the two 3-day weeks together. Euclidean Dynamic Time Warping

Time taken to create hierarchical clustering of power-demand time series. • Time to create dendrogram • using Euclidean Distance 1.2 seconds • Time to create dendrogram • using Dynamic Time Warping 3.40 hours

Q |n| |p| C Quick Note: In my examples I have assumed that the time series are of the same length. However, with DTW (unlike Euclidean distance) the two time series can be of different length. This can be an important advantage in some domains, for example, suppose you want to compare electrocardiograms which were recorded at different rates.

Q C How is DTW Calculated? I We create a matrix the size of |Q| by |C|, then fill it in with the distance between every pair of point in our two time series. C Q

Every possible warping between two time series, is a path though the matrix. We want the best one… How is DTW Calculated? II C Q C Q Warping path w

This recursive function gives use the minimum cost path. How is DTW Calculated? III (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) } C Q C Q Warping path w

This recursive function gives use the minimum cost path. How is DTW Calculated? IIII (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) } C (i,j) = d(qi,cj) + min{ (i-1,j-1) (i-1,j ) (i,j-1) }

Let us visualize the cumulative matrix on a real world problem I This example shows 2 one-week periods from the power demand time series. Note that although they both describe 4-day work weeks, the blue sequence had Monday as a holiday, and the red sequence had Wednesday as a holiday.

Let us visualize the cumulative matrix on a real world problem II

What we have seen so far… • Dynamic time warping gives much better results than Euclidean distance on virtually all problems (recall the classification example, and the clustering example) • Dynamic time warping is very very slow to calculate! Is there anything we can do to speed up similarity search under DTW?

C Q Fast Approximations to Dynamic Time Warp Distance I C Q Simple Idea: Approximate the time series with some compressed or downsampled representation, and do DTW on the new representation. How well does this work...

Fast Approximations to Dynamic Time Warp Distance II 22.7 sec 1.3 sec .. strong visual evidence to suggests it works well. Good experimental evidence the utility of the approach on clustering, classification and query by content problems also has been demonstrated.

Lower Bounding We can speed up similarity search under DTW by using a lower bounding function. Algorithm Algorithm Lower_Bounding_Sequential_Scan(Q) Lower_Bounding_Sequential_Scan(Q) Intuition Try to use a cheap lower bounding calculation as often as possible. Only do the expensive, full calculations when it is absolutely necessary. 1. 1. best_so_far best_so_far = infinity; = infinity; for for 2. 2. all sequences in database all sequences in database 3. 3. LB_dist = lower_bound_distance( LB_dist = lower_bound_distance( C C , Q); , Q); i i if if 4. 4. LB_dist < LB_dist < best_so_far best_so_far 5. 5. true_dist = DTW( true_dist = DTW( C C , Q); , Q); i i if if 6. 6. true_dist < best_so_far true_dist < best_so_far 7. 7. best_so_far best_so_far = true_dist; = true_dist; 8. 8. index_of_best_match index_of_best_match = i; = i; endif endif 9. 9. endif endif 10. 10. endfor endfor 11. 11.

Indexing and Mining Time Series Data Dr Eamonn Keogh

Indexing and Mining Time Series Data Dr Eamonn Keogh

Presentation Transcript

Data Mining: Concepts and Techniques Mining time-series data

Indexing Time Series

Eamonn Keogh

Nurjahan Begum , Bing Hu , Thanawin Rakthanmanon , and Eamonn Keogh

Indexing Time Series

i SAX: Indexing and Mining Terabyte Sized Time Series

Mining Time Series Data

Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Time Series Indexing II

Data Mining: Concepts and Techniques — Chapter 8 — 8.2 Mining time-series data

SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Mining Time-Series Databases

Mining Time Series

Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Slides at cs.ucr/~eamonn/public/cs170guest Eamonn Keogh eamonn@cs.ucr

Eamonn Keogh eamonn@cs.ucr

Classification Continued Dr Eamonn Keogh

Indexing Time Series Data

Eamonn Keogh eamonn@cs.ucr

SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin