570 likes | 717 Views
Similarity Search on Uncertain Time Series. Mi -Yen Yeh Institute of Information Science, Academia Sinica 中央研究院資訊科學研究所 葉彌妍. Outline. The field study Our two works PROUD : A PRO babilistic Approach to Processing Similarity Queries over U ncertain D ata S treams [EDBT’09]
E N D
Similarity Search on Uncertain Time Series Mi-Yen Yeh Instituteof Information Science, Academia Sinica 中央研究院資訊科學研究所 葉彌妍
Outline • The field study • Our two works • PROUD: A PRObabilistic Approach to Processing Similarity Queries over Uncertain Data Streams [EDBT’09] • Random Error Reduction in Similarity Search on Time Series: A Statistical Approach [ICDE’12] • Conclusions M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Time Series Data • Asequence of data at consecutive time instants. • hourly sensor readings of many sensors, • daily stock trading data in the financial market, value time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Time Series Mining and Analysis • Pattern Matching • Classification • Clustering and so on. Similarity … What class is this? Class A Class C Class B M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Where Uncertainty Comes from? • To protect privacy, people deliberately introduce disturbance to the confidential data before further processing. • In a sensor network, sensor readings are interfered with noise generated by the equipment itself or other exterior influences. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
We Study the Uncertain Time Series datavalue ~ Su time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Uncertain Distance Computation,How? ~ Su ~ Sv ts te time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
A Related Work Pattern matching over cloaked time series[Lian et. al., ICDE‘08] Pr{ dist(Q, Ti) ≤ r } p,where Q is the query pattern, Ti is a time series, r is query radius,and p is a user given threshold. Extracts the statistics such as mean and variance from the cloaked time series, and further sped up the matching process by taking advantage of R-tree indexing. Focus on the efficiency of pattern matching. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Our Concern Not only the speed of answering queries, but also the quality are taken into consideration. This motivates our work PROUD[Yeh et al. 2009]. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Problem Statement: Deterministic v.s. Probabilistic Similarity Queries • Deterministicthreshold queries: • Probabilistic … Euclidean distance M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Trade-off between False Alarm and False Dismissal • By adjusting t, we can control the trade-off between false alarm and false dismissal. • In sensor applications (temperature reading, vehicle speed detection), • A false negative (not discovering speeding or equipment over-heating) is LESS DESIRED. • In mobile network applications (where location privacy is an important issue) • A false negative (i.e., overprotecting) tends to be more ACCEPTABLE. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
The Model of Uncertain Time Series 12 • A general model A random variable with mean mutand deviation sutof an uncertain series Su at time stamp t. mut ~ datavalue ~ Su time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Uncertain Distance Computation ~ Su What is the distribution of this random variable? Dt ~ Sv ts te time M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Central Limit Theorem • The normal form of the variable : has a limiting cumulative distribution function that approaches a normal distribution. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Expectation of The Uncertain Distance Var(X) = E(X2) – (E(X))2 M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
To compute , what is the value of ? Variance of the Uncertain Distance (1/3) Delta method … Taylor expansion M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Variance of the Uncertain Distance (2/3) Therefore, M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Variance of the Uncertain Distance (3/3) Now we can compute , which is: M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Candidate Selection Cumulative distributed function of normal distribution 1 >t 0.9 0.8 t 0.7 0.6 0.5 F(x) 0.4 <t 0.3 0.2 0.1 0 -4 -3 -2 -1 0 1 2 3 4 x r-limit rnorm <r-limit rnorm>r-limit M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Progressively Pruning for Time-Invariant Uncertain Variances (1/2) • When the variance of the uncertainty is time invariant for each stream, i.e., then • Will leave the progressively pruning of general cases as future works. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
~ Sref ~ Su Progressively Pruning for Time-Invariant Uncertain Variances (2/2) Dt time ts te 1 0.9 t 0.8 0.7 We can guarantee that rnorm is non-increasing during the updates of E(.) and Var(.). 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 -4 -3 -2 -1 0 1 2 3 4 rnorm x M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24 r-limit
Performance Study • We compare PROUD with a deterministic method, referred to as, Det. • Settings: • Given su, at each timestamp t, we randomly draw a number from either a uniform or normal distribution with mean=Su[t] and variance= su as an uncertain value . M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
False Alarm Ratio v.s. Miss Ratio t = 0.001 t = 0.01 t = 0.1 t = 0.5 t = 0.9 M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Miss Ratio at Different T t value M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Computation Cost at Different T M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Summary of PROUD • With certain probabilistic theories, PROUD can deal with similarity queries over uncertain (streaming) time series. • We showed how we can progressively prune candidates. • The results show that PROUD provides a flexible trade-off between false alarms and miss ratios by controlling a threshold, while maintaining a similar computation cost. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
What After? • [Aßfalg’09ssdbm] • Probabilistic queries • multiple observations at one timestamp • [Sarangi’10kdd] • Design a new distance measurement that would converge to Euclidean/DTW distance when the magnitude of errors (uncertainty) is small. • The distribution of error should be known in advance. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Our New Perspective • Instead of modeling the uncertainty and the distance distribution, can we try to remove it? • This motivates our second work: MISQ [Wu et al., 2012]. • Deterministic mean distance query • Does not rely on any knowledge of the distribution of the error. • Only one observation at a time is required • Type I error controlled M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Errors in Measurement • Can be categorized into two types: • Systematic errors: predictable, can be removed by calibration of the measurement equipment. • Random errors: inherently unpredictable, have null expected values, and always present in a measurement reading. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Random Errors in Time Series • Readings in time series may contain inherent random errors due to causes like dynamic error, drift, noise, hysteresis, digitalization error and limited sampling frequency. • Random errors may affect the quality of time series analysis substantially. • Taking similarity search as an example, we develop MISQ, a statistical approach for random error reduction in time series analysis. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Shall we Reduce Random Errors? • The 1NN classification error rates on 20 real data sets in [1] for MISQ and using Euclidean distance without considering random errors. • 9 wins, 7 ties, 4 losses :Reducing random errors is beneficial in many cases! [1] Keogh et al., the UCR Time Series Classification/Clustering homepage, http://www.cs.ucr.edu/∼eamonn/time series data/ M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Modeling and Reducing Random Errors in Similarity Search on Time Series: The Main Concept M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Time Series with Random Errors Assumptions: • is unobserved, smooth (i.e., ), uniformly bounded. • is an unknown constant. • is i.i.d. with mean 0 and variance 1. • We only have one observation value, Su(t), at each time t. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Mean Distance as Similarity Measurement • We want to compute the mean distance, which should be the distance excluding the effect of random errors. • All the time series are independent to each other. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Two Similarity Queries Given a reference series and a set of time series T , we retrieve those time series such that • for Exact match Query • for Threshold Similarity Querywhere r >0 is a user given distance threshold. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Intuition for the Query Processing • μQand μu are unobserved values in practice. Thus, we cannot compute D(μQ, μu) directly. • Out intuition: • Use only the observation values to estimateit, • Apply statistical hypothesis testings to determine if a candidate time series qualifies a query at certain confidence level. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
The Hypothesis Testing Procedure The null hypothesis of • Exact match query • Threshold similarity query • We use ≤ instead of = is for the convenience of the testing. The query retrieves those that do not reject the null hypothesis. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
The Statistical Errors We think this is more important in Similarity Search! • Type I error: • Reject a true H0 • A low type I error rate implies a high recall. • Type II error: • Fails to reject a false H0 • A low type II error rate implies a high precision. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Query Processing with Type I Error Controlled We can then control the type I error rate tonot greater than α. Given a reference series , a set of time series T, and a user-specified confidence level (1- a) [0,1], we retrieve all time series such that • for exact match query: • for threshold similarity query: where LCI(.) is the lower bound of the confidence interval of the estimated mean distance. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Mean Distance Estimator, and Its Reliability M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Mean Distance Estimator and its Variance Parameter-free, difference-based estimator: where l is the length of a time series. where Only the observation values are used! M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Asymptotic Distribution of the Mean Distance Estimator The asymptotic distribution of the mean distance estimator can be well approximated by Normal distribution. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Query Processing by Statistical Testing M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Compute LCI(D(μQ, μu)) Suppose the time series length l is large enough. Given a reference series , a set of time series T, and a user-specified confidence level (1- a) [0,1], we retrieve all time series such that • for exact match query: • for threshold similarity query: M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Type I Error of an Exact Match Query is Controlled Theorem 3 in our paper. • Exact match query: D(μQ,μu)=0 (or <=0) • So the type I error rate is M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Type I Error of a Threshold Similarity Query is Controlled Theorem 4 in our paper. • Threshold Similarity Query: D(μQ,μu)<=r • So the type I error rate is M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Experiment Results M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Settings • Methods to compared: • Confidence band (for exact match only): statistical testing on (mu-mQ)=0. • Done by the testing for no effect in nonparametric regression via kernel smoothing. Retrieve those series with p-value> a. • Moving average + error control method of MISQ • Movavg_5: with a bandwidth =5. • Movavg_cv: with a bandwidth determined by cross validation which minimizes the leave-one-out residual sum of square. • Datasets: • On 20 real data sets in the UCR Time Series Classification/Clustering data repository.Keogh et al., http://www.cs.ucr.edu/∼eamonn/time series data/ M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
How we Do Exact Match Query To test type I error rate: Original time series Su . . . . . blur with noise of uncertainty ratio=r Make 100 blurred time series To test type II error rate: M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24
Results of Exact Match Queries a =0.05 • MISQ controlled the type I error rate better than the confidence band method. • Movavg 5 and movagv cv hardly controlled the type I error rates, although they had very small type II error rates. • MISQ sacrificed the type II error rate to control the type I error rate, since a low type I error rate is more important in many applications. M.-Y. Yeh, Similarity Search on Uncertain Time Series, @NCKU CSIE 2013/05/24