1.07k likes | 1.2k Views
Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects. Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, and Alberto Lerner {shasha,yunyue, xiaojian, zhihua, lerner}@cs.nyu.edu Courant Institute, New York University.
E N D
Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, and Alberto Lerner {shasha,yunyue, xiaojian, zhihua, lerner}@cs.nyu.edu Courant Institute, New York University
Goal of this work • Time series are important in so many applications – biology, medicine, finance, music, physics, … • A few fundamental operations occur all the time: burst detection, correlation, pattern matching. • Do them fast to make data exploration faster, real time, and more fun.
Sample Needs • Pairs Trading in Finance: find two stocks that track one another closely. When they go out of correlation, buy one and sell the other. • Match a person’s humming against a database of songs to help him/her buy a song. • Find bursts of activity even when you don’t know the window size over which to measure. • Query and manipulate ordered data.
Why Speed Is Important • Person on the street: “As processors speed up, algorithmic efficiency no longer matters” • True if problem sizes stay same. • They don’t. As processors speed up, sensors improve – e.g. satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc. • Desire for real time response to queries.
Surprise, surprise • More data, real-time response, increasing importance of correlation IMPLIES Efficient algorithms and data management more important than ever!
Corollary • Important area, lots of new problems. • Small advertisement: High Performance Discovery in Time Series (Springer 2004). At this conference.
Outline • Correlation across thousands of time series • Query by humming: correlation + shifting • Burst detection: when you don’t know window size • Aquery: a query language for time series.
Real-time Correlation Across Thousands (and scaling) of Time Series
Scalable Methods for Correlation • Compress streaming data into moving synopses. • Update the synopses in constant time. • Compare synopses in near linear time with respect to number of time series. • Use transforms + simple data structures. (Avoid curse of dimensionality.)
GEMINI framework* * Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25-27. pp 419-429.
StatStream (VLDB,2002): Example • Stock prices streams • The New York Stock Exchange (NYSE) • 50,000 securities (streams); 100,000 ticks (trade and quote) • Pairs Trading, a.k.a. Correlation Trading • Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?” XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours. Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down. They should converge back later. I will sell XYZ and buy ABC …
Correlated! Online Detection of High Correlation • Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. • Real time • high update frequency of the data stream • fixed response time, online
Online Detection of High Correlation • Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. • Real time • high update frequency of the data stream • fixed response time, online
Correlated! Online Detection of High Correlation • Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time. • Real time • high update frequency of the data stream • fixed response time, online
StatStream: Naïve Approach • Goal: find most highly correlated stream pairs over sliding windows • N : number of streams • w : size of sliding window • space O(N) and time O(N2w) . • Suppose that the streams are updated every second. • With a Pentium 4 PC, the exact computing method can monitor only 700 streams, where each result is produced with a separation of two minutes. • Note: “Punctuated result model” – not continuous, but online.
StatStream: Our Approach • Use Discrete Fourier Transform to approximate correlation as in Gemini approach. • Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“window size”) • Use grid structure to filter out unlikely pairs • Our approach can report highly correlated pairs among 10,000 streams for the last hour with a delay of 2 minutes. So, at 2:02, find highly correlated pairs between 1 PM and 2 PM. At 2:04, find highly correlated pairs between 1:02 and 2:02 PM etc.
Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests
Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs
Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs Sliding window digests: sum DFT coefs
Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs Sliding window digests: sum DFT coefs
Sliding window StatStream: Stream synoptic data structure • Three level time interval hierarchy • Time point, Basic window, Sliding window • Basic window (the key to our technique) • The computation for basic window i must finish by the end of the basic window i+1 • The basic window time is the system response time. • Digests Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Time point Basic window
How general technique is applied • Compress streaming data into moving synopses: Discrete Fourier Transform. • Update the synopses in time proportional to number of coefficients: basic window idea. • Compare synopses in real time: compare DFTs. • Use transforms + simple data structures (grid structure).
Synchronized Correlation Uses Basic Windows • Inner-product of aligned basic windows Stream x Stream y Basic window Sliding window • Inner-product within a sliding window is the sum of the inner-products in all the basic windows in the sliding window.
f1(1) f1(2) f1(3) f1(4) f1(5) f1(6) f1(7) f1(8) f2(1) f2(2) f2(3) f2(4) f2(5) f2(6) f2(7) f2(8) f3(1) f3(2) f3(3) f3(4) f3(5) f3(6) f3(7) f3(8) Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) x1 x2 x3 x4 x5 x6 x7 x8
Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) x1 x2 x3 x4 x5 x6 x7 x8
y1 y2 y3 y4 y5 y6 y7 y8 Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) x1 x2 x3 x4 x5 x6 x7 x8
y1 y2 y3 y4 y5 y6 y7 y8 Approximate Synchronized Correlation • Approximate with an orthogonal function family (e.g. DFT) • Inner product of the time series Inner product of the digests • The time and space complexity is reduced from O(b) to O(n). • b : size of basic window • n : size of the digests (n<<b) • e.g. 120 time points reduce to 4 digests x1 x2 x3 x4 x5 x6 x7 x8
sliding window sliding window Approximate lagged Correlation • Inner-product with unaligned windows • The time complexity is reduced from O(b) to O(n2) , as opposed to O(n) for synchronized correlation. Reason: terms for different frequencies are non-zero in the lagged case.
x Grid Structure(to avoid checking all pairs) • The DFT coefficients yields a vector. • High correlation => closeness in the vector space • We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs.
Empirical Study : Speed Our algorithm is parallelizable.
Empirical Study: Accuracy • Approximation errors • Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation • The approximation errors (mistake in correlation coef) are small.
Sketches : Random Projection* • Correlation between time series of the returns of stock • Since most stock price time series are close to random walks, their return time series are close to white noise • DFT/DWT can’t capture approximate white noise series because the energy is distributed across many frequency components. • Solution : Sketches (a form of random landmark) • Sketch pool: list of random vectors drawn from stable distribution • Sketch : The list of inner products from a data vector to the sketch pool. • The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee. • W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984 • D. Achlioptas. “Database-friendly random projections”. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press,2001
Sketches : Intuition • You are walking in a sparse forest and you are lost. • You have an old-time cell phone without GPS. • You want to know whether you are close to your friend. • You identify yourself as 100 meters from the pointy rock, 200 meters from the giant oak etc. • If your friend is at similar distances from several of these landmarks, you might be close to one another. • The sketch is just the set of distances.
Sketches : Random Projection inner product sketches random vector raw time series
Sketches approximate distance well(Real distance/sketch distance) (Sliding window size=256 and sketch size=80)
Empirical Study: Sketch on Price and Return Data • DFT and DWT work well for prices (today’s price is a good predictor of tomorrow’s) • But badly for returns (todayprice – yesterdayprice)/todayprice. • Data length=256 and the first 14 DFT coefficients are used in the distance computation, db2 wavelet is used here with coefficient size=16 and sketch size is 64
Sketch Guarantees • Note: Sketches do not provide approximations of individual time series window but help make comparisons. Johnson-Lindenstrauss Lemma: • For any and any integer n, let k be a positive integer such that • Then for any set V of n points in , there is a map such that for all • Further this map can be found in randomized polynomial time
Overcoming curse of dimensionality* • May need many random projections. • Can partition sketches into disjoint pairs or triplets and perform comparisons on those. • Each such small group is placed into an index. • Algorithm must adapt to give the best results. *Idea from P.Indyk,N.Koudas, and S.Muthukrishnan. “Identifying representative trends in massive time series data sets using sketches”. VLDB 2000.
X Y Z Inner product with random vectors r1,r2,r3,r4,r5,r6
Further Performance Improvements -- Suppose we have R random projections of window size WS. -- Might seem that we have to do R*WS work for each timepoint for each time series. -- In ongoing work with colleague Richard Cole, we show that we can cut this down by use of convolution and an oxymoronic notion of “structured random vectors”*. *Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Empirical Study: Speed • Sketch/DFT+Grid structure • Sliding Window Size=3616, basic window size=32 • Correlation>0.9
Query By Humming • You have a song in your head. • You want to get it but don’t know its title. • If you’re not too shy, you hum it to your friends or to a salesperson and you find it. • They may grimace, but you get your CD
With a Little Help From My Warped Correlation • Karen’s humming Match: • Dennis’s humming Match: • “What would you do if I sang out of tune?" • Yunyue’s humming Match:
Related Work in Query by Humming • Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99] • Music represented by string of pitch directions: U, D, S (degenerated interval) • Hum query is segmented to discrete notes, then string of pitch directions • Edit Distance between hum query and music score • Problem • Very hard to segment the hum query • Partial solution: users are asked to hum articulately • New Method : matching directly from audio [Mazzoni and Dannenberg 00] • We use both.