250 likes | 467 Views
Online Event-driven Subsequence Matching over Financial Data Streams. Huanmei Wu, Betty Salzberg, Donghui Zhang SIGMOD 2004. Outline . Introduction Motivation Data Stream Processing Subsequence Matching Trend Prediction Performance Conclusion. Introduction.
E N D
Online Event-driven Subsequence Matching over Financial Data Streams Huanmei Wu, Betty Salzberg, Donghui Zhang SIGMOD 2004
Outline • Introduction • Motivation • Data Stream Processing • Subsequence Matching • Trend Prediction • Performance • Conclusion SIGMOD 2004
Introduction • Subsequence matching tries to find subsequences from the large data sequences in the database that are similar to a given query sequence • It is important in data mining • Trend prediction • Pattern recognition • Dynamic clustering of multiple data streams • Rule discovery SIGMOD 2004
Motivation • Existing techniques on time series subsequence matching focus on discovering the similarity between an online querying subsequence and a traditional database SIGMOD 2004
S1 Price S2 4 4’ 2’ 2 5 5’ 1 3’ 1’ 3 time Motivation (cont.) • Subsequence similarity over financial data streams has its unique properties • Zigzag shape of piecewise linear representation (PLR) • Relative position of end points is important • Price change (amplitude) is more important than time interval SIGMOD 2004
Data Stream Processing (1) Aggregation • Piecewise Linear Representation requires a unique value for each time interval • Aggregation of the raw data • filter out the noise before further data processing aggregated data stream SIGMOD 2004
Data Stream Processing (2) Smoothing • moving average • widely used in the financial market • X(i) is the value for i = 1, 2, ..., n and n is the number of periods. MAp(i) calculates the p-interval moving average time series which assigns equal weight to every point in the averaging interval SIGMOD 2004
Data Stream Processing (3) Smoothing • Bollinger Band Percent (%b) SIGMOD 2004
Data Stream Processing (4) Smoothing • Bollinger Band Percent (%b) • %b is a normalized value of the real price between -1and 2 %b data stream SIGMOD 2004
Data Stream Processing (5) Smoothing • segmentation over %b is more suitable than that directly over the raw price data stream • %b has a smoothed moving trend similar to the price movement • %b is a normalized value of the real price between -1and 2 • Uniform segmentation criteria • %b is very sensitive to the price change, and it will manifest the price change accurately without any delay SIGMOD 2004
Data Stream Processing (6) Segmentation • PLR may not be in a zigzag shape • Finds end points of the PLR that are points at which the trend changes dramatically • All other points are considered as noise and should be eliminated SIGMOD 2004
Data Stream Processing (7) Segmentation over %b Pi 10 9 12 8 Price (x) 7 11 13 6 1 Pj 2 4 5 3 Sliding Window t • In the current sliding window, where Pj(Xj,tj) is the current point, Pi(Xi, ti) is an upper end point if, • Xi = max ( X values of the current sliding window ) • Xi > Xj + ( where is the given error threshold ) • Pi(Xi, ti) is the last one satisfying the above two conditions SIGMOD 2004
Data Stream Processing (8) Segmentation over %b • delay time • the time difference between the actually time of an end point and the time when it is identifies as an end point • A smaller will reduce the delay time but result in a larger number of short line segments • some of which may still be noise • A larger will decrease the number of line segments but with longer delay • some useful information will be filtered out SIGMOD 2004
Data Stream Processing (9) Pruning • The process of removing noise-like line segment • Segmentation finds potential end points using a smaller threshold s • shorter delay time • Noise introduced by small swill be filtered out SIGMOD 2004
Data Stream Processing (10)Online segmentation and pruning s: a smaller threshold for segmentation over %b pb: a larger threshold for pruning over %b pd: a threshold for pruning over raw stream data SIGMOD 2004
Subsequence Similarity (1) Subsequence Permutation S = {(X1, t1), (X2, t2), …, (Xn, tn)} Separate upper and lower points S’ = { [(X1, t1), (X3, t3), …, (Xn-1, tn-1)], [(X2, t2), (X4, t4), …, (Xn, tn)] } Sort separately based on X values S” = {[(Xi1, ti1), (Xi3, ti3), …, (Xi(n-1), ti(n-1))], [(Xi2, ti2), (Xi4, ti4), …, (Xin, tin)] } Get the subsequence permutation {i1, i3, …, in-1, i2, i4, …, in} SIGMOD 2004
Subsequence Similarity (2) New similarity measure S1 = {(X1, t1), (X2, t2), …, (Xn, tn)} S2 = {(X1', t1'), (X2', t2'), …, (Xn', tn')} S1 and S2 are similar if they satisfy the following two conditions : • S1 and S2 have the same permutation • d(S1, S2) < , where d(S1, S2) = ( * ||(Xi+1 - Xi)| - |(Xi+1' - Xi')|| + * |(ti+1 - ti) - (ti+1' - ti')|) where , , 0 are user defined parameters SIGMOD 2004
Subsequence Similarity (3) Special cases • If a query subsequence has any pairs of upper points (or lower points) with distance under a certain predefined threshold, we consider the query subsequence to have two permutations • Subsequences of the two possible permutations are both searched SIGMOD 2004
Subsequence Similarity (3) Event-driven subsequence matching • event means a new potential end point is being identified and no pruning is need • The query subsequence is the most recent n fixed and potential end points • The search algorithm finds subsequences in the historical data similar to a query subsequence Price 4 2 1 3 t t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 …… t37 t38 t39 t40 SIGMOD 2004
Trend prediction (1)Definition of trend SIGMOD 2004
Trend prediction (2)prediction • Subsequence similarity search returns a list of end points which are the last end points of one retrieved subsequence SIGMOD 2004
Performance (1) Similarity measure 70 65 60 55 50 45 40 35 30 Correctness % Perm+Amp Perm+Euc Euc Only Amp Only Perm Only SIGMOD 2004
Performance (2) Event–driven vs. Fixed time periods 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 70 65 60 55 50 45 40 35 30 Relative CPU cost Correctness % FT1 FT10 FT25 FT5 FT15 FT20 FT30 FT1 FT5 FT10 FT15 FT20 FT25 FT30 Event-driven Event-driven SIGMOD 2004
Conclusion • Finding trend of E by computing the distance between E and Ek may loss important information SIGMOD 2004