Continuous Similarity-Based Queries on

Continuous Similarity-Based Queries on Streaming Time Series By Like Gao and Xiaoyang Sean Wang Narrated by Craig White

What is the purpose of Gao and Wang’s paper? • Reduce the number of clock cycles used to run a query? • Save the managers money by getting more out of • their computers? C) Give you your data now? Answer: C, give you your data now!

Why? Because waiting one nanosecond is too long to get what is yours.

When There Were Dinosaurs… Old methods search for data on a first come first served sequential basis. Slow because you have to wait for for everyone in front of you. Who wants to wait in line?

Rivers Of Data In Gao and Wang’s scenarios, data requests are constantly streaming in. While the database is searching, the request keep coming in. Sequential searches can not keep up with the request.

Mammals Are Faster One way to improve performance is to batch database request together. How? Divide constant streams into discrete segments of the same time period. Batch similar request together Database will optimize batch search request.

How Fast Can A Hippo Run? Making the batches takes time. Still have to query and wait for the database. How can we make things go faster?

Age Of Man If we could predict the requests, the CPU could used waiting cycles to form the batches. Batched requests could be sent in advanced. Data would be ready when the user submits his query. Don’t have to wait a nanosecond. Remember: Waiting is BAD!

Hallucinations?!? Streaming data requests can form patterns Need to find the patterns and predict next N steps. If predictions can be grouped together, use them as the basis of the database requests. By sending the predicted batch requests far enough in advance, the data back by the time the users makes their requests. This technique is call Continuous Query with Prediction (CQP) and uses a Fast Fourier Transformation(FFT)

Rocket Science N D(x,y) = Sqrt{ (x[s] – y[s])2 / (l + 1)} S=0 D = Euclidean Distance,  (sigma) in normal stats. x, y = two finite patterns, same length l = length of x or y Measures how similar or different x and y are. Defines the acceptable error in finding results.

Building The Rocket Using Fast Fourier Transformation we have: ___ ____ DFT CirCCorx,y <X[0] * Y[0], … , X[N] * Y[N]> N + 1 CirCCorrx,y = Circular Correlation X, Y = Finite series of length N + 1 _ Y = Complex conjugate of Y This is how we match patterns together.

Why Dinosaurs Can Not Be Spacemen Launch Batch Processing at ps and ps+N

Dinosaurs Cannot Do This • Step Action • From the next position ps, generate n predicted • values, and form the predicted series PS. 2. Use the batch process on PS with all pattern series Fi to generate predicted distances for positions ps, … , ps + n = 1. 3. For each time position p, within the range from ps to ps + n – 1, when the actual value arrives do:

Most Mammals Cannot Do This Either 3.1 Use the prediction error, i.e., the distance between the precicted values and actual values, and predicted distances to partition the patterns Fi into three catagories: Category (1): Those that satisfy the query. Category (2): Those that cannot satisfy the query Category (3): Those that are in neither category (1) or (2) 3.2 Verify among the candidate patters to find (further) answers to the query.

Only Humans • Change ps to be ps + n and perform steps 1-4 • repeatedly. Continuous Query with Prediction(CQP) works best with: a) long durations stream sources b) data patterns with length < batch periods c) small (not too small) linear acceptable error ranges.

Chimps Are Humans Closest Cousins Candidates fall in the range from the lower bound to just below the Upper Bound.

Can We Teach Chimps To Count? Steps to find the candidate patterns for the nearest neighbor. Step 1. QuickSort(TM) the predicted list based on value. Step 2. Find the acceptable error range from the predicted values. Step 3. Locate the candidates that fall inside the acceptable error range.

Lets Test Rockets Using Chimps Experiments were done on test data * Program written in C++ * Dell Dimension 4100 * 256 MB Main Memory * PIII 766 CPU * Synthetic stream(IS) at point i IS[i] = 100 (sin(0.1 * RndWalk[i ]) + 1 + i / 20,000), i = 0, … 19,999 RndWalk[0:19,999] is a random-walk series

Selecting Chimps Q1: Find the nearest neighbor of an incoming stream series at each time position used to determine cost in terms of CPU cycles Q2: Find the 30.0 near neighbors an incoming stream series at each time position used to determine relative response time

Launching Chimps Queries without Continuous Query with Prediction(CQP) averaged 65 seconds Queries with CQP took only 29 seconds Average response time per query was about 1.5 ms with CQP vs. 3.3 ms without CQP Factor of 2x+ with Continuous Query with Prediction!

Even Chimps Can Fly Even on short data streams, CQP can reduce response time. With short duration streams, the CPU actually has to work more with CQP. We still get our data faster because it has been pre-fetched.

Launching Humans Next experiment used real world data PAF Prediction Challenge Database Electrocardiogram(ECG) data set from PhysioNet and Computers in Cardiology Organization. Contains 50 streams from known Paroxysmal Atrial Fibrillation(PAF) patients and 50 unknown streams. Sample rates are 128 per second. Values discretized with 0.2 millivolt accuracy. Stream lengths are 30 minutes each.

No Dinosaurs Allowed Here Using Continuous Query with Prediction we cut the number of CPU cycles to 25% to 40%. Continuous Query with Prediction reduced response time to 5% to 20% compared to sequential searches. Data is streaming in at 1 request per 7.8125 milliseconds. Sequential searches take 10.5 milliseconds. Searches with CQP take only three to four milliseconds. Sequential searches can not keep up with the requests, but CQP can easily handle the load.

Why Dinosaurs Went Extinct Which method of searching do you want your doctor to use during your hear operation?

Continuous Similarity-Based Queries on