320 likes | 454 Views
BRAID: Stream Mining through Group Lag Correlations. Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005. Introduction. Lag correlations : For example: Higher amounts of fluoride in water → fewer dental cavities some years later Goal :
E N D
BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005
Introduction • Lag correlations : • For example: Higher amounts of fluoride in water → fewer dental cavities some years later • Goal : • Monitor multiple numerical streams determine the pair correlated with lag and the value
Introduction • k numerical sequences X1,…Xk , report all pair of Xi and Xj which Xi follow Xj with lag l
Introduction • In this paper, propose BRAID handle data stream of semi-infinite length • Any time processing, and fast • Nimble • Accurate • Small resource consumption
Proposed method • Data stream X : {x1, …, xt, ..., xn} , xn is the most recent value • R(0) : X and Y with the same length n and have zero lag • ρ Coefficient :
Proposed method • For lag l ,consider common part of X and shifted Y , only n-l time ticks
Proposed method • R(l) : correlation coefficient, X is delayed by l • Score at lag l :
Proposed method • R(l) for large value of lag l≈ n, the original and shifted time sequence have too few overlapping • Restrict maximum lag m to be n/2
Proposed method • Naive solution : • At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1,…) • Choose earliest max score above r , or report no lag • The solution based on three major step
Proposed method • Need some sufficient statistics for R to computed easily • Sx(l,n) = : sum of X of length n • Sxx(l,n) = : sum of square X of length n • Sxy(l) = : sum of square X of length n
Proposed method • R(l) is obtained :
Proposed method • R(l) can estimate at any point time, only need to keep track five sufficient statistics • It still needs linear time to compute the cross-correlation function between two sequences
Proposed method • Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2i,. • Only O(logn) number to track of, instead of O(n) that “Naïve solution” requires • Space required grow linearly with length n
Proposed method • In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space • Instead of operating on original time sequence, also compute their smoothed version by computing non-overlapping windows
Proposed method • Window size : power of g=2 • X : original time sequence • Axh : smoothed version with window of length 2h • Ax0 : original sequence, Ax1 : consists of n/2 ticks ,..etc • Axh ‘s sufficient statistic need compute every 2h time ticks • At time n, need O(log n) level, for each level compute sufficient statistic
Proposed method • In contrast with small lags, the larger one are sparse • Use cubic spline to interpolate the missing correlation coefficient
Proposed method • Axh(t) : window average at time tick t for level h • Axh(0) ≡ xt
Proposed method • Sufficient statistics:
Enhanced BRAID • If two sequence of size ≈ 220, require about 5*log 220 = 5*20=100 float numbers , about 800 bytes • Large memory available, propose a solution to probe more but use O(log n) space • Use mix of arithmetic plus geometric probing
Enhanced BRAID • BRAID use only one window at each smoothing level • Propose use b>1 windows, b=4 instead • Algorithm before b=1,with exception bottom level has 2b coefficient • While computing R(l), use mixture geometric and arithmetic progression:
Enhanced BRAID • Example of enhanced BRAID of b=4 • The algorithm behind if b=1 also equal to the algorithm before
Conclusion • Proposed BRAID to detection lag correlation on streaming data • At any time • Low resource consumption • High accuracy