170 likes | 333 Views
Building Efficient Time Series Similarity Search Operator. Mijung Kim Summer Internship 2013 at HP Labs. Overview. The internship project is a part of a project that: builds a scalable analytics framework and c onstructs a set of analytic operators within the framework
E N D
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs
Overview • The internship project is a part of a project that: • builds a scalable analytics framework and • constructs a set of analytic operators within the framework • Trade-off performance with available resources • Multiple implementations with different trade-offs for each operator • Mechanism to choose an implementation given constraints • My goal is to build a time series similarity search operator • Parallel data processing • Alternative implementations for the time series similarity search
What is Time Series? • Time series data is a sequence of data points repeatedly measured over time Example: Image from wikipedia http://en.wikipedia.org/wiki/Time_series
Time Series Similarity Search Given a time series database (T) and query pattern (P), find k-nearest neighbors of the query in the database Query length (m) Time series Segment (T_i(j), …, T_i(j+m)) Time series database (T) • Use cases: • Targeted marketing, • Anomaly detection, many more… Query pattern (P) O(N_t *n*m) N_t: # time series, n: time series length, m: query length Linear to the query length –inefficient for large query lengths! Distance
FFT (Fast Fourier Transform) based Search • Time series data in the time domain can be transformed to the frequency domain • We can compute the distance without a time series point by point comparison in each time series segment in the time domain. FFT for each time series can be pre-processed and re-used for each time series segment! Image from wikipedia http://en.wikipedia.org/wiki/Convolution O(N_t*n*logn) N_t: # time series, n: time series length Independent from the query length
Time Series Search with MapReduce Query pattern Horizontally partitioned time series database Time Series Partition_1 Map_1 Top-K Query result Time series database Time Series Partition_2 Top-K Reducer Map_2 Top-K Top-K … … … Time Series Partition_n Map_n Compute the distance between each time series segment in the partition and the query
FFT-based vs. Naïve Search Single machine vs. Cluster (e.g., >15X gain on cluster mode) FFT-based search cost is independent from the query length (efficient for larger query lengths but naïve search is better for smaller query lengths) - We can develop query plans based on the query length!
Lessons so far • FFT is proven to be efficient in the time series similarity search operation but • There are other more (theoretically) efficient techniques for the time series similarity search operator, e.g., LSH • Parallel data processing with MapReduce on a cluster environment helps but • Lacks of rich data analytic algorithms commonly supported by statistical software such as MATLAB and R • We investigate frameworks that support R with MapReduceas a general analytic operation framework
Why R + MapReduce? - R is a free software and a widely used programming language/framework/environment for statistical computation for data analysis and graphics - R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. Parallel Processing On Cluster Environment Rich Data Analytics Algorithms and Graphics In-Memory computation of R is impractical for large-scale data analysis!
Parallel R(Split-apply-combine) apply split R functions partition combine R functions partition Aggregate function input : : : : R functions partition
Examples (R+MapReduce) R instance (forecast) R function (ARIMA) Arima model input input R instance (forecast) R function (ARIMA) input Measure error Arima model input : : : : : : : : R function (ARIMA) Arima model input R instance (forecast) input Movie Ratings of each customer Arima (Autoregressive Integrated Moving Average) model of each customer Different training periods [IBM Ricardo, Das et al. SIGMOD ‘10] [Googleparallelism, Stokely et al. JSM ‘11]
Time Series Search on RHIPE RHIPE (www.rhipe.org) - Open-source R package - Provides an abstraction layer that allows users to formulate MapReduce jobs in R scripts FFT R function R array R code Protocol buffer rJava (R <-> Java) Java code Java BytesWritable Map_1 Time Series Partition_1 Top-K Query result Time series database Map_2 Top-K Reducer Time Series Partition_2 Top-K Top-K … … … Time Series Partition_n Map_n Query pattern
Summary • Built a time series similarity operator for a scalable data analytic framework • Working with mentors: Jun Li (System) and Krishnamurthy Viswanathan (Data scientist) • Played a role as a bridge to interoperate between parallel system and data analysis • : Designing parallel processing for data analytic algorithms and implementing the algorithms on cluster environment Parallel Processing On Cluster Environment (Hadoop) My Role Data Analysis (R, Matlab, C/C++)
Conclusion (What I gain…) - Parallel data processing - Relational database - Java, MATLAB, C/C++, R, … - Machine learning algorithms Internship work Research work (+ industry experience) - Time series data analysis - Mathematical techniques (FFT/LSH) - Hadoop, JNI, … What’s more… - An invention disclosure regarding the time series similarity search filed in HP - Network with leading researchers in my research area