640 likes | 816 Views
Time-Series Data Management. Yonsei University 2 nd Semester, 2014 Sanghyun Park * The slides were extracted from the material presented at ICDM’01 by Eamonn Keogh. Contents. Introduction, motivation Utility of similarity measurements Indexing time series Summary, conclusions. 29. 28.
E N D
Time-Series Data Management Yonsei University 2nd Semester, 2014 Sanghyun Park * The slides were extracted from the material presented at ICDM’01by Eamonn Keogh
Contents • Introduction, motivation • Utility of similarity measurements • Indexing time series • Summary, conclusions
29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 What Are Time Series? • A time series is a collection of observations made sequentially in time 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500
Time Series Are Ubiquitous (1/2) • People measure things … • The presidents approval rating • Their blood pressure • The annual rainfall in Riverside • The value of their Yahoo stock • The number of web hits per second • And things change over time and thus time series occur in virtually every medical, scientific and business domain
Time Series Are Ubiquitous (2/2) • A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series
Time Series Similarity • Defining the similarity between two time series is at the heart of most time series data mining applications/tasks • Thus time series similarity will be the primary focus of this lecture
Classification Clustering Utility Of Similarity Search (1/2)
Rule Discovery 10 s = 0.5 c = 0.3 Query by Content Query Q (template) Utility Of Similarity Search (2/2)
Challenges Of Research On Time Series (1/3) • How do we work with very large databases? • 1 hour of ECG data: 1 gigabyte • Typical web log: 5 gigabytes per week • Space shuttle database: 158 gigabytes and growing • Macho database: 2 terabytes, updated with 3 gigabytes per day • Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate
Challenges Of Research On Time Series (2/3) • We are dealing with subjective notions of similarity • The definition of similarity depends on the user, the domain, and the task at hand. We need to handle this subjectivity
Challenges Of Research On Time Series (3/3) • Miscellaneous data handling problems • Differing data formats • Differing sampling rates • Noise, missing values, etc
Whole Matching vs.Subsequence Matching (1/2) • Whole matchingGiven a query Q, a reference database C, anda distance measure, find the Ci that best matches Q Query Q (template) 6 1 7 2 8 3 C6is the best match 9 4 10 5 Database C
Whole Matching vs.Subsequence Matching (2/2) • Subsequence matchingGiven a query Q, a reference database C, and a distance measure, find the location that best matches Q Query Q (template) Database C The best matching subsection
Motivation Of Similarity Search • You go to the doctor because of chest pains. Your ECG looks strange … • Your doctor wants to search a database to find similar ECGs, in the hope that they will offer clues about your condition … • Two questions • How do we define similar? • How do we search quickly?
Defining Distance Measures • Definition: Let O1 and O2 be two objects from the universe of possible objects. Their distance is denoted as D(O1,O2) • What properties should a distance measure have? • D(A,B) = D(B,A) Symmetry • D(A,A) = 0 Constancy of self-similarity • D(A,B) = 0 IIf A=B Positivity • D(A,B) ≤ D(A,C) + D(B,C) Triangluar inequality
D(Q,C) The Minkowski Metrics p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p = Max (Supremum, “sup”)
Given two time seriesQ=q1…qn and C=c1…cn,their Euclidean distance is defined as: C Q D(Q,C) Euclidean Distance Metric
Processing The Data Before Distance Calculation • If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results • This is because Euclidean distance is very sensitive to some distortions in the data • For most problems these distortions are not meaningful, and thus we can and should remove them • Four most common distortions • Offset translation • Amplitude scaling • Linear trend • Noise
3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Offset Translation D(Q,C) Q = Q - mean(Q) C = C - mean(C) D(Q,C) 0 50 100 150 200 250 300
0 100 200 300 400 500 600 700 800 900 1000 Amplitude Scaling 0 100 200 300 400 500 600 700 800 900 1000 Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) D(Q,C)
5 4 3 2 12 1 10 0 8 -1 6 -2 4 -3 0 20 40 60 80 100 120 140 160 180 200 2 0 -2 5 -4 0 20 40 60 80 100 120 140 160 180 200 4 3 2 1 0 -1 -2 -3 0 20 40 60 80 100 120 140 160 180 200 Linear Trend Removed offset translation Removed amplitude scaling Removed linear trend Removed offset translation The intuition behind removing linear trend is this: Fit the best fitting straight line to the time series, then subtract that line from the time series Removed amplitude scaling
8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Noise The intuition behind removing noise is this: Average each datapoint value with its neighbors Q = smooth(Q) C = smooth(C) D(Q,C)
Fixed Time Axis Sequences are aligned “one to one”. “Warped” Time Axis Nonlinear alignments are possible. Dynamic Time Warping • We will first see the utility of DTW, then see how it is calculated
Cylinder-Bell-Funnel Cylinder Funnel Bell Utility of DTW: Example I,Machine Learning • This dataset has been studied in a machine learning context by many researchers • Recall that, by definition, the instances of Cylinder-Bell-Funnel are warped in the time axis
Classification Experiment onC-B-F Dataset (1/2) • Experimental settings • Training data consists of 10 exemplars from each class • (One) Nearest neighbor algorithm • “Leaving-one-out” evaluation, averaged over 100 runs • Results • Error rate using Euclidean Distance: 26.10% • Error rate using Dynamic Time Warping: 2.87% • Time to classify one instance using Euclidean Distance: 1 sec • Time to classify one instance using Dynamic Time Warping: 4,320 sec
Classification Experiment onC-B-F Dataset (2/2) • Dynamic time warping can reduce the error rate by an order of magnitude • Its classification accuracy is competitive with sophisticated approaches like decision tree, boosting, neural networks, and Bayesian techniques • But, it is slow …
Sunday Friday Saturday Thursday Monday Tuesday Wednesday Wednesday was a national holiday Utility of DTW: Example II,Data Mining • Power-demand time series: each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997
4 5 3 6 7 2 1 Hierarchical Clustering withEuclidean Distance The two 5-day weeks are correctly grouped. Note however, that the three 4-day weeks are not clustered together. Also, the two 3-day weeks are also not clustered together.
6 4 7 5 3 2 1 Hierarchical Clustering withDynamic Time Warping The two 5-day weeks are correctly grouped. The three 4-day weeks are clustered together. The two 3-day weeks are also clustered together.
Time Taken to Create Hierarchical Clustering of Power-Demand Time Series • Time to create dendrogram using Euclidean Distance: 1.2 seconds • Time to create dendrogram using Dynamic Time Warping: 3.40 hours
Q wk p j C w1 1 1 n i Computing the Dynamic Time Warp Distance (1/2) • Note that the input sequences can be of different lengths Q |n| |p| C
Computing the Dynamic Time Warp Distance (2/2) Q • Every possible mapping from Q to C can be represented as a warping path in the search matrix • We simply want to find the cheapest one … • Although there are exponentially many such paths,we can find one in only quadratic time using dynamic programming |n| |p| C (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) }
Q wk p j C w1 1 1 n i Fast Approximation to Dynamic Time Warping Distance (1/2) • Simple idea: approximate the time series with some compressed or downsampled representation, and do DTW on the new representation • How well does this work …
Fast Approximation to Dynamic Time Warping Distance (2/2) • … Strong visual evidence to suggest it works well 22.7 sec 1.3 sec
Weighted Distance Measures (1/3) • Intuition: for some queries different parts of the sequence are more important
D(Q,C) D(Q,C,W) Weighted Distance Measures (2/3) The height of this histogram indicates the relative importance of that part of the query W
Term Vector [Jordan , Cow, Bull, River] Term Weights [ 1 , 1 , 1 , 1 ] Search Display Results Gather Feedback Term Vector [Jordan , Cow, Bull, River] Update Weights Term Weights [ 1.1 , 1.7 , 0.3 , 0.9 ] Weighted Distance Measures (3/3) • How do we set the weights? • One possibility: relevance feedback which is the reformulation of a query in response to feedback provided by the user for the results of previous query
Indexing Time Series (1/6) • We have seen techniques for assessing the similarity of two time series • However we have not addressed the problem of finding the best match to a query in a large database … • The obvious solution, to retrieveand examine every item(sequential scanning), simplydoes not scale to large datasets • We need some way to indexthe data
Indexing Time Series (2/6) • We can project time series of length n into n-dimension space • The first value in C is the X-axis, the second value in C is the Y-axis, etc. • One advantage of doing this isthat we have abstracted awaythe details of “time series”,now all query processing canbe imagined as finding pointsin space …
Q Indexing Time Series (3/6) • We can project the query time series Q into the same n-dimension space and simply look for the nearest points • The problem is that we have to look at every point to find the nearest neighbor
Euclidean Weighted Euclidean Manhattan Max Indexing Time Series (4/6) • The Minkowski metrics have simple geometric interpolations
R1 R4 R2 R5 R3 R6 R9 R7 R8 Indexing Time Series (5/6) • We can group clusters of datapoints with “boxes” called Minimum Bounding Rectangles (MBR) • We can further recursively group MBRs into larger MBRs
R10 R11 R10 R11 R12 R1 R2 R3 R4 R5 R6 R7 R8 R9 R12 Data nodes containing points Indexing Time Series (6/6) • These nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, Hybrid-tree, etc.
Dimensionality Curse (1/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • For the one dimensional space, the answer is clearly 2
Dimensionality Curse (2/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • For the two dimensional case, the answer is 8
Dimensionality Curse (3/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • For the three dimensional case, the answer is 26
Dimensionality Curse (4/4) • If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? • More generally, in n-dimensional space we must examine 3n-1 MBRs; n = 21 → 10,460,353,201 MBRs • This is known as the curse of dimensionality
Spatial Access Methods • We can use Spatial Access Methods like the R-tree to index our data, but … • The performance of R-trees degrades exponentially with the number of dimensions. Somewhere above 6-20 dimensions the R-tree degrades to linear scanning • Often we want to index time series with hundreds, perhaps even thousands of features
GEMINI (GEneric Multimedia INdexIng){Christos Faloutsos} (1/8) • Establish a distance metric from a domain expert • Produce a dimensionality reduction technique that reduces the dimensionality of the data from n to N, where N can be efficiently handled by your favorite SAM • Produce a distance measure defined on N dimensional representation of the data, and prove that it obeys Dindexspace(A,B) ≤ Dtrue(A,B) (lower bounding lemma) • Plug into an off-the-shelve SAM
A 3 2.5 2 C 1.5 1 B 0.5 F 0 -0.5 -1 3 2 D 3 1 2 E 1 0 0 -1 -1 -2 -2 -3 -3 -4 GEMINI (GEneric Multimedia INdexIng){Christos Faloutsos} (2/8) • We have 6 objects in 3-D space. We issue a query to find all objects within 1 unit of the point (-3, 0, -2)