The Landmark Model: An Instance Selection Method for Time Series Data

The Landmark Model: An Instance Selection Method for Time Series Data C.-S. Perng, S. R. Zhang, and D. S. Parker Instance Selection and Construction for Data Mining, Chapter 7, pp. 113-130 Cho, Dong-Yeon

Introduction • Complexity • Patterns: continuous time series segments with particular features • The reflection of events in time series is better represented by patterns. • The complexity of processing patterns • The number of all possible segments for a time series of length N is N(N+1)/2. • A simple inspection of each of these segments takes O(N3). • Good instance selection algorithms are especially helpful here, since they can greatly reduce complexity by reducing the volume of data.

Similarity Model • Euclidian distance does not match human intuition. • 1,2,3,4,3 and 3,4,5,6,5 • Previous works • None of these proposed techniques supports a similarity model that can both capture the similarity and support efficient pattern querying of time series.

Pattern Representation • Two formats for temporal association rules to verify the cause-effect relation • Forward association: C1,…,Cn E1,…,Em • Backward association: C1,…,Cn E1,…,Em • Association rules can be either formulated as hypotheses and verified with data, or be discovered by data mining process. • It is sill not clear what kind of segments can represented event. • What is the basic vocabulary for spelling association rule?

Noise Removal and Data Smoothing • Commonly-used smoothing techniques, such as moving averages, often lag or miss the most significant peaks and bottoms. • These peaks and bottoms can be very meaningful, and smoothing or removing them can lose a great deal of information. • Little previous work takes smoothing as an integral part of the process of pattern definition, index construction, and query processing.

The Landmark Data Model and Similarity Model • The Landmark Concept • Episodic memory: human and animals depend on landmarks in organizing their spatial memory • Landmarks: (times, events) • Using landmarks instead of the raw data for processing • N-th order landmark of a curve if the N-th order derivative is 0. • Local maxima, local minima, and inflection points • Tradeoff • The more different types of landmarks in use, the more accurately a time series will be represented. • Using fewer landmarks will result in storage savings and smaller index trees.

Stock market data • Almost half of the record • The normalized error is reasonably small when the curve is reconstructed from the landmarks. • The more volatile the time series, the less significant the higher-order landmarks.

Smoothing • Minimal Distance/Percentage Principle (MDPP) • A minimal distance D and a minimal percentage P • Remove landmarks (xi, yi) and (xi+1, yi+1) if

The effect of the MDPP

Normalized error generated by the MDPP and DFT

Transformations • Six kinds of transformations • Shifting: SHk(f) such that SHk(f(t))=f(t)+k where k is a constant. • Uniform Amplitude Scaling: UASk(f) such that UASk(f(t))=kf(t) where k is a constant. • Uniform Time Scaling: UTSk(f) such that UTSk(f(t))=f(kt) where k is a positive constant. • Uniform Bi-scaling: UBSk(f) such that UBSk(f(t))=kf(t/k) where k is a positive constant. • Time Warping: TWg(f) such that TWg(f(t))=f(g(t)) where g is a positive and monotonically increasing. • Non-uniform Amplitude Scaling: NASg(f) such that NASg(f(t))=g(t) where for every t, g´(t)=0 if and only if f´(t)=0.

The more transformation included in a similarity model, the more powerful the similarity model.

These transformations can be composed to form new transformations. • The composition order is flexible: • The composition is idempotent: • Two time series are defined to be similar if they differ only by a transform.

Landmark Similarity • Dissimilarity measure • Given two sequences of landmarks L= L1,…,Ln and L´= L´1,…,Lń where Li=(xi, yi) and Lí=(xí, yí), the distance between the k-th landmark is defined by where • The distance between the two sequences is • We define

A land mark similarity measure is a binary relation on time series segments defined by a 5-tuple LSM=D,P,T,time,amp. • Given two time series sequences s1 and s2, let L1 and L2 be the landmark sequences after MDPP(D, P) smoothing. • (s1, s2)LMS if and only if |L1|=|L2| and there exist two parameterized transformations T1 and T2 of T whose dissimilarity satisfies time(T1(L1), T2(L2)) < time and amp(T1(L1), T2(L2)) < amp.

Data Representation • Family of Time Series Segments • Equivalent under the six transformations • Replacing naïve landmark coordinates with various features of landmarks that are invariant under these transformations • F = {y, h, v, hr, vr, vhr, pv} hi=xi-xi-1vi=yi-yi-1hri=hi+1/hivri=vi+1/vivhri=vi/ hipvi=vi/yi • Invariant features under transformations

Conclusion • Landmark Model • An instance selection system for time series • This integrates similarity measures, data representation and smoothing techniques in a single framework. • Minimal Distance/Percentage Principle (MDPP): The smoothing method for the Landmark Model • This also supports a generalized similarity model which can ignore differences corresponding to six transformations. • Intuitive to human

The Landmark Model: An Instance Selection Method for Time Series Data