530 likes | 555 Views
Providing error guarantees for analytic query processing over large historical time series data with deterministic accuracy to support future predictions, abnormality detection, and correlation analysis in a distributed environment.
E N D
DeterministicError Guarantees for Queries onCompressedTime Series Chunbin Lin Joint with Etienne Boursier, Jacque Brito, KorhanDemirkaya, Joshua Lapacik, YannisPapakonstantinou
Motivation • Fast analytic query processing over historical time series is necessary • Future prediction • Abnormally detection • Similarity matching Compute the correlation of the foreign-exchange CAD/JPY and AUD/JPY CAD/JPY AUD/JPY public health analyst
Challenge • Historical time series is big • 1 billion data points for each forex*1 • 8 TB operational data per day for each oil drilling rig*2 *1https://pepperstone.com/en/client-resources/historical-tick-data *2 https://wasabi.com/storage-solutions/internet-of-things/
Solutions • Distributed query processing in many machines • Approximate query processing in a singlemachine • sampling methods Probabilistic error guarantees E.g., the actual answer is within with 95% confidence • our goal Deterministic error guarantees E.g., the actual answer is within with 95% confidence
Data Time Series: a sequence of (timestamp, value) pairs • Assume queries involve time series with the same resolution • Omit timestamps • 1, 10000, • [ • 115.80, • 115.90, • 116.25, • 116.30, • 116.11, • 116.15, • 116.16, • 116.06, • 115.72, • ...... • ] • [ • (20170103931, 115.80), • (20170103932, 115.90), • (20170103933, 116.25), • (20170103934, 116.30), • (20170103935, 116.11), • (20170103936, 116.15), • (20170103937, 116.16), • (20170103938, 116.06), • (20170103939, 115.72), • ... ... • ] Apple stock price
Query • Time subseries operators • Arithmetic operators (+,−×,÷,√ ) • E.g., 100+20, 100-20, 100*20, 100/20…
Query • Statistic queries • Covariance, Correlation, Cross-correlation, ……
Query • Statistic queries • Covariance, Correlation, Cross-correlation, …… base time series time series produced by time series operators
Segment list index • Index: a list of compressed time series segments f(x) = a x + b segment Forex CAD/JPY (the Canadian Dollar and the Japanese Yen) • For each segment, we store: • Estimation function (minimize Euclidean distance) • Error measures (a , b) • L2-norm of errors: • Reconstruction error: • L2-norm of estimated values:
Segment list index • Estimation function families no limitation on estimation functions polynomial function family exponential function family logarithmic function family logistic function family gaussian function family sin/cos function family
Segment list index • Error guarantees • L2-norm of errors: • Reconstruction error: • L2-norm of estimated values: depends on the data values 5.4 4.8 f(x) = 1.2x + 2.0 3.0
Segment list index • Existing index building algorithms • Fix-length segmentation (FL) : control segment size • Sliding-window segmentation (SW): control reconstruction error • …… CAD/JPY CAD/JPY AUD/JPY AUD/JPY E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In ICDM, pages 289–296, 2001.
Offline precomputation phase – building indexes • We build a segment list index for each time series • We store an estimation function and error measures for each segment
Online query processing – providing deterministic error guarantees
Error guarantees • Actual error: the absolute difference between the true answer R and the estimated answer , i.e., • Error guarantee: the upper bound of the actual error, i.e.,
Error guarantees • Providing the error guarantee for each Sum(T) is the key base time series time series produced by time series operators If we can provide an error guarantee for each Sum(Ti), then we are able to give the error guarantee for general queries
Query over single segment • Error guarantees for time series operators T1 T2
Query over single segment • Error guarantee of Sum(T1 x T2)
Query over single segment • Error guarantee of Sum(T1 x T2) = 0 if the estimation function family forms a vector space (VS) • Vector space: A set that is closed under finite vector addition and scalar multiplication • Polynomial function family is a vector space
Query over single segment =0 =0 Orthogonal projection property in VS
Query over single segment • Error guarantee of Sum(T1 x T2) Estimation function family is not VS Estimation function family is VS
Query over aligned segments • Aligned segments • All the segments are perfectly aligned • Error guarantees • Sum of the error guarantees of each segment pair CAD/JPY AUD/JPY
Query over aligned segments CAD/JPY AUD/JPY
Query over misaligned segments • Misaligned segments • One segment overlaps with more than one segment CAD/JPY AUD/JPY
Query over misaligned segments • Sum(T1 x T2) • Segment combination selection becomes an optimization problem • Minimize CAD/JPY AUD/JPY
Query over misaligned segments • Segment combination selection • Intersection Strategy (IS) • Maximal number of segments • Optimal Strategy (OS) • Minimal error combination CAD/JPY AUD/JPY
Query over misaligned segments • Orthogonal projection property • Cannot be applied, not aligned • Estimation function for a subsegmentmay not be in the family CAD/JPY Linear scalable family (LSF): the restriction of any function in LSF to a smaller domain is still a function in LSF PF LSF AUD/JPY VS ANY LSF is a superset of the polynomial function family (PF)
Query over misaligned segments • Sum(T1 x T2) • If estimation functions are in LSF CAD/JPY AUD/JPY
Error guarantee properties • Tightness • With the same error measures, no other error guarantee is smaller than it for queries on all the data • Amplitude-independence (AI) • Not using the amplitudes in the error guarantees E.g., Changing from Celsius to Kelvin will not change the error guarantees
Error guarantee properties Queries on aligned segments Function family Queries on misaligned segments AI Tight AI Tight Sum(T1 x T2) ANY\VS VS\LSF LSF ANY Sum(T1+ T2) Sum(T1- T2) ANY
Error guarantee properties • Dichotomies of function families LSF VS ANY\LSF ANY\VS AI AI non-AI non-AI Queries on misaligned segments Queries on aligned segments
Experiments • Dataset
Experiments • Estimation functions [1] [2] [3] E. Keogh. Fast similarity search in the presence of longitudinal scaling in time series databases. In ICTAI, pages 578–584, 1997. M. Tobita. Combined logarithmic and exponential function model for fitting postseismicgnsstime series after 2011 tohoku-oki earthquake. Earth, Planets and Space, 68(1):41, 2016. Z. Pan, Y. Hu, and B. Cao. Construction of smooth daily remote sensing time series data: a higher spatiotemporal resolution perspective. Open Geospatial Data, Software and Standards, 2(1):25, 2017.
Experiments • Segment list building algorithms • Fix-length segmentation (FL) • Sliding-window segmentation (SW) • Queries • Correlation query • cross-correlation query
Experiments • Error guarantees for queries on aligned time series • 20 correlation queries • FL segment lists building Power of orthogonal property • VS uses less space than ANY • VS uses 0.035% while ANY uses 0.06%
Experiments • Error guarantees for queries on misaligned time series • 20 correlation queries • SW segment lists building 2 1 Effect of LSF (~100x) 1 Effect of optimal segment combination selection (~10x) 2
Experiments • Aligned vs. misaligned • Fix space, compare error guarantees for aligned and misaligned • K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K Misaligned produces smaller true errors 1 Misaligned produces smaller error guarantees 2 ~ 3x for ANY 2 1
Experiments • Aligned vs. misaligned • Fix space, compare error guarantees for aligned and misaligned • K segments in misaligned case, N data points involved in the query, then set segment size in FL to be N/K Misaligned produces smaller error guarantees 1 ~ 8.2 x for LSF 1
Experiments • Index building time • Query processing time
Experiments • Compare with sampling method • uniform random sampling scheme with a global seed Sampling size to provide same error guarantees with those of VS Sampling size to provide same error guarantees with those of ANY confidence
Conclusion • Provide deterministic error guarantees for statistic queries over aligned segments and misaligned segments. • Provide optimizations to reduce the error guarantees in both scenarios. • Study the properties – AI and tight– of the proposed error guarantees • Conduct experiments to evaluate the error guarantees
Future work Deterministic error guarantees for interactive analytic queries over compressed time series
Architecture • Build segment tree index for each time series (offline) • A node refers to a compressed segment • Each segment, we store estimation function and error measures • Tree may not be a balanced tree • Navigate trees to access minimal number of nodes to get answers with error guarantees less than given threshold value (online)
Segment tree index • One tree structure for each time series • A node refers to a compressed segment • Each segment, we store estimation function and error measures • Tree may not be a balanced tree • Segment tree building algorithms: • Top-down algorithm • Bottom-up method • Sliding-window approach *Fu, Tak-chung. "A review on time series data mining." Engineering Applications of Artificial Intelligence 24.1 (2011): 164-181.
Query processing algorithm • Given query and error budget, access minimal number of nodes to get approximate answers with error guarantees less than the error budgets Consider query = (Agg(Times(T1, T2)), 10% Time series T2 Time series T1
Query processing algorithm • Performance-wise optimization An incrementalupdatesegmentation algorithm that gives ratio compared with the optimal one. • Space-wise optimization Avoid storing the estimation functions for the right nodes. Estimation function can be deduced from the parent node and the left sibling node via an invert basis matrix Only red nodes store estimation functions
Thank you Q&A