Similarity Search on Uncertain Time Series

Similarity Search on Uncertain Time Series Mi-Yen Yeh Instituteof Information Science, Academia Sinica 中央研究院資訊科學研究所葉彌妍

Outline • The field study • Our two works • PROUD: A PRObabilistic Approach to Processing Similarity Queries over Uncertain Data Streams [EDBT’09] • Random Error Reduction in Similarity Search on Time Series: A Statistical Approach [ICDE’12] • Conclusions M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Time Series Data • Asequence of data at consecutive time instants. • hourly sensor readings of many sensors, • daily stock trading data in the financial market, value time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Time Series Mining and Analysis • Pattern Matching • Classification • Clustering and so on. Similarity … What class is this? Class A Class C Class B M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Where Uncertainty Comes from? • To protect privacy, people deliberately introduce disturbance to the confidential data before further processing. • In a sensor network, sensor readings are interfered with noise generated by the equipment itself or other exterior influences. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

We Study the Uncertain Time Series datavalue ~ Su time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Uncertain Distance Computation,How? ~ Su ~ Sv ts te time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

A Related Work Pattern matching over cloaked time series[Lian et. al., ICDE‘08] Pr{ dist(Q, Ti) ≤ r } p,where Q is the query pattern, Ti is a time series, r is query radius,and p is a user given threshold. Extracts the statistics such as mean and variance from the cloaked time series, and further sped up the matching process by taking advantage of R-tree indexing. Focus on the efficiency of pattern matching. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Our Concern Not only the speed of answering queries, but also the quality are taken into consideration. This motivates our work PROUD[Yeh et al. 2009]. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Problem Statement: Deterministic v.s. Probabilistic Similarity Queries • Deterministicthreshold queries: • Probabilistic … Euclidean distance M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Trade-off between False Alarm and False Dismissal • By adjusting t, we can control the trade-off between false alarm and false dismissal. • In sensor applications (temperature reading, vehicle speed detection), • A false negative (not discovering speeding or equipment over-heating) is LESS DESIRED. • In mobile network applications (where location privacy is an important issue) • A false negative (i.e., overprotecting) tends to be more ACCEPTABLE. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

The Model of Uncertain Time Series 12 • A general model A random variable with mean mutand deviation sutof an uncertain series Su at time stamp t. mut ~ datavalue ~ Su time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Uncertain Distance Computation ~ Su What is the distribution of this random variable? Dt ~ Sv ts te time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Central Limit Theorem • The normal form of the variable : has a limiting cumulative distribution function that approaches a normal distribution. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Expectation of The Uncertain Distance Var(X) = E(X2) – (E(X))2 M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

To compute , what is the value of ? Variance of the Uncertain Distance (1/3) Delta method … Taylor expansion M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Variance of the Uncertain Distance (2/3) Therefore, M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Variance of the Uncertain Distance (3/3) Now we can compute , which is: M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Candidate Selection Cumulative distributed function of normal distribution 1 >t 0.9 0.8 t 0.7 0.6 0.5 F(x) 0.4 <t 0.3 0.2 0.1 0 -4 -3 -2 -1 0 1 2 3 4 x r-limit rnorm <r-limit rnorm>r-limit M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Progressively Pruning for Time-Invariant Uncertain Variances (1/2) • When the variance of the uncertainty is time invariant for each stream, i.e., then • Will leave the progressively pruning of general cases as future works. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

~ Sref ~ Su Progressively Pruning for Time-Invariant Uncertain Variances (2/2) Dt time ts te 1 0.9 t 0.8 0.7 We can guarantee that rnorm is non-increasing during the updates of E(.) and Var(.). 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 -4 -3 -2 -1 0 1 2 3 4 rnorm x M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24 r-limit

Performance Study • We compare PROUD with a deterministic method, referred to as, Det. • Settings: • Given su, at each timestamp t, we randomly draw a number from either a uniform or normal distribution with mean=Su[t] and variance= su as an uncertain value . M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

False Alarm Ratio v.s. Miss Ratio t = 0.001 t = 0.01 t = 0.1 t = 0.5 t = 0.9 M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Miss Ratio at Different T t value M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Computation Cost at Different T M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Summary of PROUD • With certain probabilistic theories, PROUD can deal with similarity queries over uncertain (streaming) time series. • We showed how we can progressively prune candidates. • The results show that PROUD provides a flexible trade-off between false alarms and miss ratios by controlling a threshold, while maintaining a similar computation cost. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

What After? • [Aßfalg’09ssdbm] • Probabilistic queries • multiple observations at one timestamp • [Sarangi’10kdd] • Design a new distance measurement that would converge to Euclidean/DTW distance when the magnitude of errors (uncertainty) is small. • The distribution of error should be known in advance. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Our New Perspective • Instead of modeling the uncertainty and the distance distribution, can we try to remove it? • This motivates our second work: MISQ [Wu et al., 2012]. • Deterministic mean distance query • Does not rely on any knowledge of the distribution of the error. • Only one observation at a time is required • Type I error controlled M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Errors in Measurement • Can be categorized into two types: • Systematic errors: predictable, can be removed by calibration of the measurement equipment. • Random errors: inherently unpredictable, have null expected values, and always present in a measurement reading. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Random Errors in Time Series • Readings in time series may contain inherent random errors due to causes like dynamic error, drift, noise, hysteresis, digitalization error and limited sampling frequency. • Random errors may affect the quality of time series analysis substantially. • Taking similarity search as an example, we develop MISQ, a statistical approach for random error reduction in time series analysis. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Shall we Reduce Random Errors? • The 1NN classification error rates on 20 real data sets in [1] for MISQ and using Euclidean distance without considering random errors. • 9 wins, 7 ties, 4 losses :Reducing random errors is beneficial in many cases! [1] Keogh et al., the UCR Time Series Classification/Clustering homepage, http://www.cs.ucr.edu/∼eamonn/time series data/ M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Modeling and Reducing Random Errors in Similarity Search on Time Series: The Main Concept M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Time Series with Random Errors Assumptions: • is unobserved, smooth (i.e., ), uniformly bounded. • is an unknown constant. • is i.i.d. with mean 0 and variance 1. • We only have one observation value, Su(t), at each time t. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Mean Distance as Similarity Measurement • We want to compute the mean distance, which should be the distance excluding the effect of random errors. • All the time series are independent to each other. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Two Similarity Queries Given a reference series and a set of time series T , we retrieve those time series such that • for Exact match Query • for Threshold Similarity Querywhere r >0 is a user given distance threshold. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Intuition for the Query Processing • μQand μu are unobserved values in practice. Thus, we cannot compute D(μQ, μu) directly. • Out intuition: • Use only the observation values to estimateit, • Apply statistical hypothesis testings to determine if a candidate time series qualifies a query at certain confidence level. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

The Hypothesis Testing Procedure The null hypothesis of • Exact match query • Threshold similarity query • We use ≤ instead of = is for the convenience of the testing. The query retrieves those that do not reject the null hypothesis. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

The Statistical Errors We think this is more important in Similarity Search! • Type I error: • Reject a true H0 • A low type I error rate implies a high recall. • Type II error: • Fails to reject a false H0 • A low type II error rate implies a high precision. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Query Processing with Type I Error Controlled We can then control the type I error rate tonot greater than α. Given a reference series , a set of time series T, and a user-specified confidence level (1- a) [0,1], we retrieve all time series such that • for exact match query: • for threshold similarity query: where LCI(.) is the lower bound of the confidence interval of the estimated mean distance. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Mean Distance Estimator, and Its Reliability M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Mean Distance Estimator and its Variance Parameter-free, difference-based estimator: where l is the length of a time series. where Only the observation values are used! M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Asymptotic Distribution of the Mean Distance Estimator The asymptotic distribution of the mean distance estimator can be well approximated by Normal distribution. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Query Processing by Statistical Testing M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Compute LCI(D(μQ, μu)) Suppose the time series length l is large enough. Given a reference series , a set of time series T, and a user-specified confidence level (1- a) [0,1], we retrieve all time series such that • for exact match query: • for threshold similarity query: M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Type I Error of an Exact Match Query is Controlled Theorem 3 in our paper. • Exact match query: D(μQ,μu)=0 (or <=0) • So the type I error rate is M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Type I Error of a Threshold Similarity Query is Controlled Theorem 4 in our paper. • Threshold Similarity Query: D(μQ,μu)<=r • So the type I error rate is M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Experiment Results M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Settings • Methods to compared: • Confidence band (for exact match only): statistical testing on (mu-mQ)=0. • Done by the testing for no effect in nonparametric regression via kernel smoothing. Retrieve those series with p-value> a. • Moving average + error control method of MISQ • Movavg_5: with a bandwidth =5. • Movavg_cv: with a bandwidth determined by cross validation which minimizes the leave-one-out residual sum of square. • Datasets: • On 20 real data sets in the UCR Time Series Classification/Clustering data repository.Keogh et al., http://www.cs.ucr.edu/∼eamonn/time series data/ M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

How we Do Exact Match Query To test type I error rate: Original time series Su . . . . . blur with noise of uncertainty ratio=r Make 100 blurred time series To test type II error rate: M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Results of Exact Match Queries a =0.05 • MISQ controlled the type I error rate better than the confidence band method. • Movavg 5 and movagv cv hardly controlled the type I error rates, although they had very small type II error rates. • MISQ sacrificed the type II error rate to control the type I error rate, since a low type I error rate is more important in many applications. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24

Similarity Search on Uncertain Time Series

Similarity Search on Uncertain Time Series

Presentation Transcript

Time Series

Seeds for Similarity Search

Geometry of Similarity Search

Time Series

Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition

Time series

Building Efficient Time Series Similarity Search Operator

Superseding Nearest Neighbor Search on Uncertain Spatial Databases

Qualitative approximation to Dynamic Time Warping similarity between time series data

Analysis of Constrained Time-Series Similarity Measures

FTW: Fast Similarity Search under the Time Warping Distance

Database Similarity Search

Connected Substructure Similarity Search

Similarity Search

Probabilistic Similarity Search for Uncertain Time Series

Probabilistic Similarity Queries in Uncertain Databases

Content-Based Similarity Search

Similarity Measure Based on Partial Information of Time Series

Biosequence Similarity Search on the Mercury System

Operators for Similarity Search

Database Similarity Search