Deterministic Error Guarantees for Compressed Time Series Queries

DeterministicError Guarantees for Queries onCompressedTime Series Chunbin Lin Joint with Etienne Boursier, Jacque Brito, KorhanDemirkaya, Joshua Lapacik, YannisPapakonstantinou

Motivation • Fast analytic query processing over historical time series is necessary • Future prediction • Abnormally detection • Similarity matching Compute the correlation of the foreign-exchange CAD/JPY and AUD/JPY CAD/JPY AUD/JPY public health analyst

Challenge • Historical time series is big • 1 billion data points for each forex*1 • 8 TB operational data per day for each oil drilling rig*2 *1https://pepperstone.com/en/client-resources/historical-tick-data *2 https://wasabi.com/storage-solutions/internet-of-things/

Solutions • Distributed query processing in many machines • Approximate query processing in a singlemachine • sampling methods Probabilistic error guarantees E.g., the actual answer is within with 95% confidence • our goal Deterministic error guarantees E.g., the actual answer is within with 95% confidence

Data Time Series: a sequence of (timestamp, value) pairs • Assume queries involve time series with the same resolution • Omit timestamps • 1, 10000, • [ • 115.80, • 115.90, • 116.25, • 116.30, • 116.11, • 116.15, • 116.16, • 116.06, • 115.72, • ...... • ] • [ • (20170103931, 115.80), • (20170103932, 115.90), • (20170103933, 116.25), • (20170103934, 116.30), • (20170103935, 116.11), • (20170103936, 116.15), • (20170103937, 116.16), • (20170103938, 116.06), • (20170103939, 115.72), • ... ... • ] Apple stock price

Query • Time subseries operators • Arithmetic operators (+,−×,÷,√ ) • E.g., 100+20, 100-20, 100*20, 100/20…

Query • Statistic queries • Covariance, Correlation, Cross-correlation, ……

Query • Statistic queries • Covariance, Correlation, Cross-correlation, …… base time series time series produced by time series operators

Offline precomputation phase – building indexes

Segment list index • Index: a list of compressed time series segments f(x) = a x + b segment Forex CAD/JPY (the Canadian Dollar and the Japanese Yen) • For each segment, we store: • Estimation function (minimize Euclidean distance) • Error measures (a , b) • L2-norm of errors: • Reconstruction error: • L2-norm of estimated values:

Segment list index • Estimation function families no limitation on estimation functions polynomial function family exponential function family logarithmic function family logistic function family gaussian function family sin/cos function family

Segment list index • Error guarantees • L2-norm of errors: • Reconstruction error: • L2-norm of estimated values: depends on the data values 5.4 4.8 f(x) = 1.2x + 2.0 3.0

Segment list index • Existing index building algorithms • Fix-length segmentation (FL) : control segment size • Sliding-window segmentation (SW): control reconstruction error • …… CAD/JPY CAD/JPY AUD/JPY AUD/JPY E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289–296, 2001.

Offline precomputation phase – building indexes • We build a segment list index for each time series • We store an estimation function and error measures for each segment

Online query processing – providing deterministic error guarantees

Error guarantees • Actual error: the absolute difference between the true answer R and the estimated answer , i.e., • Error guarantee: the upper bound of the actual error, i.e.,

Error guarantees • Providing the error guarantee for each Sum(T) is the key base time series time series produced by time series operators If we can provide an error guarantee for each Sum(Ti), then we are able to give the error guarantee for general queries

Query over single segment • Error guarantees for time series operators T1 T2

Query over single segment • Error guarantee of Sum(T1 x T2)

Query over single segment • Error guarantee of Sum(T1 x T2) = 0 if the estimation function family forms a vector space (VS) • Vector space: A set that is closed under finite vector addition and scalar multiplication • Polynomial function family is a vector space

Orthogonal projection property in VS

Query over single segment =0 =0 Orthogonal projection property in VS

Query over single segment • Error guarantee of Sum(T1 x T2) Estimation function family is not VS Estimation function family is VS

Query over aligned segments • Aligned segments • All the segments are perfectly aligned • Error guarantees • Sum of the error guarantees of each segment pair CAD/JPY AUD/JPY

Query over aligned segments CAD/JPY AUD/JPY

Query over misaligned segments • Misaligned segments • One segment overlaps with more than one segment CAD/JPY AUD/JPY

Query over misaligned segments • Sum(T1 x T2) • Segment combination selection becomes an optimization problem • Minimize CAD/JPY AUD/JPY

Query over misaligned segments • Segment combination selection • Intersection Strategy (IS) • Maximal number of segments • Optimal Strategy (OS) • Minimal error combination CAD/JPY AUD/JPY

Query over misaligned segments • Orthogonal projection property • Cannot be applied, not aligned • Estimation function for a subsegmentmay not be in the family CAD/JPY Linear scalable family (LSF): the restriction of any function in LSF to a smaller domain is still a function in LSF PF LSF AUD/JPY VS ANY LSF is a superset of the polynomial function family (PF)

Query over misaligned segments • Sum(T1 x T2) • If estimation functions are in LSF CAD/JPY AUD/JPY

Error guarantee properties • Tightness • With the same error measures, no other error guarantee is smaller than it for queries on all the data • Amplitude-independence (AI) • Not using the amplitudes in the error guarantees E.g., Changing from Celsius to Kelvin will not change the error guarantees

Error guarantee properties Queries on aligned segments Function family Queries on misaligned segments AI Tight AI Tight Sum(T1 x T2) ANY\VS VS\LSF LSF ANY Sum(T1+ T2) Sum(T1- T2) ANY

Error guarantee properties • Dichotomies of function families LSF VS ANY\LSF ANY\VS AI AI non-AI non-AI Queries on misaligned segments Queries on aligned segments

Experiments • Dataset

Experiments • Estimation functions [1] [2] [3] E. Keogh. Fast similarity search in the presence of longitudinal scaling in time series databases. In ICTAI, pages 578–584, 1997. M. Tobita. Combined logarithmic and exponential function model for fitting postseismicgnsstime series after 2011 tohoku-oki earthquake. Earth, Planets and Space, 68(1):41, 2016. Z. Pan, Y. Hu, and B. Cao. Construction of smooth daily remote sensing time series data: a higher spatiotemporal resolution perspective. Open Geospatial Data, Software and Standards, 2(1):25, 2017.

Experiments • Segment list building algorithms • Fix-length segmentation (FL) • Sliding-window segmentation (SW) • Queries • Correlation query • cross-correlation query

Experiments • Error guarantees for queries on aligned time series • 20 correlation queries • FL segment lists building Power of orthogonal property • VS uses less space than ANY • VS uses 0.035% while ANY uses 0.06%

Experiments • Error guarantees for queries on misaligned time series • 20 correlation queries • SW segment lists building 2 1 Effect of LSF (~100x) 1 Effect of optimal segment combination selection (~10x) 2

Experiments • Aligned vs. misaligned • Fix space, compare error guarantees for aligned and misaligned • K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K Misaligned produces smaller true errors 1 Misaligned produces smaller error guarantees 2 ~ 3x for ANY 2 1

Experiments • Aligned vs. misaligned • Fix space, compare error guarantees for aligned and misaligned • K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K Misaligned produces smaller error guarantees 1 ~ 8.2 x for LSF 1

Experiments • Index building time • Query processing time

Experiments • Compare with sampling method • uniform random sampling scheme with a global seed Sampling size to provide same error guarantees with those of VS Sampling size to provide same error guarantees with those of ANY confidence

Conclusion • Provide deterministic error guarantees for statistic queries over aligned segments and misaligned segments. • Provide optimizations to reduce the error guarantees in both scenarios. • Study the properties – AI and tight– of the proposed error guarantees • Conduct experiments to evaluate the error guarantees

Future work Deterministic error guarantees for interactive analytic queries over compressed time series

Architecture • Build segment tree index for each time series (offline) • A node refers to a compressed segment • Each segment, we store estimation function and error measures • Tree may not be a balanced tree • Navigate trees to access minimal number of nodes to get answers with error guarantees less than given threshold value (online)

Segment tree index • One tree structure for each time series • A node refers to a compressed segment • Each segment, we store estimation function and error measures • Tree may not be a balanced tree • Segment tree building algorithms: • Top-down algorithm • Bottom-up method • Sliding-window approach *Fu, Tak-chung. "A review on time series data mining." Engineering Applications of Artificial Intelligence 24.1 (2011): 164-181.

Query processing algorithm • Given query and error budget, access minimal number of nodes to get approximate answers with error guarantees less than the error budgets Consider query = (Agg(Times(T1, T2)), 10% Time series T2 Time series T1

Query processing algorithm • Performance-wise optimization An incrementalupdatesegmentation algorithm that gives ratio compared with the optimal one. • Space-wise optimization Avoid storing the estimation functions for the right nodes. Estimation function can be deduced from the parent node and the left sibling node via an invert basis matrix Only red nodes store estimation functions

Thank you Q&A

Deterministic Error Guarantees for Compressed Time Series Queries

Deterministic Error Guarantees for Compressed Time Series Queries

Presentation Transcript

Time Series

Time Series 2 Time Series 1

Slides 13b: Time-Series Models; Measuring Forecast Error

SOMs for time series

Combinatorial Compressed Sensing: Fast algorithms with Recovery Guarantees

Time series

Time Series

Using Web Queries for Learner Error Detection

Time Series

Disclosure risk when responding to queries with deterministic guarantees

Towards Execution Guarantees for Stream Queries

Time series

Time Series

Online Interval Skyline Queries on Time Series

Time Series

Deterministic Wavelet Thresholding for Maximum-Error Metrics

Online Interval Skyline Queries on Time Series

Non-deterministic time

Wavelet Synopses with Error Guarantees

Performance Guarantees for Distributed Reachability Queries

Deterministic Importance Sampling with Error Diffusion

Time Series