A fast time series data server

A fast time series data server Bob Weigel George Mason University Status: In development

Motivation • Want to do fast large scale analysis on time series data • Data volume and data processing speed often do matter! • Speed enables many services. • If you want users to contribute data, provide • $, • free storage, • better organization and search than their OS + local file system provides, or • services on their data that are better than what they can do on their local machine.

Demo

The problems • Heliophysics “data bases” • The “granule” paradigm. The fundamental unit is the granule (file) contains many parameters. • The “small-box” paradigm. Given a user request, return a list of granule URLs that match. User needs to do the rest. Leads to slow-downs in response time to queries by up to a factor of 100! • Fundamental unit exposed to scientist should be the data set. Requires “aggregation”. Can be client-side or server-side. • Well-know and widely available RDBS don’t work well for time series (“column-based” versus “row-based”)

Approaches for Large Scale Analysis 0. Let the user do the “aggregation” • Service: The “run-on-demand” paradigm - A reader (or “accessor”) is developed for each data provider that downloads data to the user's computer, extracts the relevant parts, and puts the data in a uniform form in an array or structure in the user's software analysis program. • Disadvantages: Requires high server reliability (servers are typically run by scientists …). Higher sever load, higher data transfer volume. • Advantages: No additional disk space. Always up-to-date. • Service: The “pre-caching” approach - The data are stored in a uniform manner on an intermediate server. The user makes a request to a single server. • Disadvantages: more disk space. Cache may be out-of-date • Advantages: 5-100x speed-ups in response. Reliability (Errors are caught ahead-of-time as are server problems). Many new services will be enabled.

Ideal Approach Note that pre-caching requires “run-on-demand” solutions, but takes data a step further Note that “run-on-demand” approach will eventually develop a caching approach anyway – better to develop caching as a separate component => Use “pre-caching” for reasonably sized data sets. Use “run-on-demand” for large data sets and for filling cache . A significant portion of heliophysics data could be pre-cached.

Question Why hasn’t this been done before? • Looks like data centralization. • Without improved data base, improvements using existing infrastructure is incremental.

Only one data type • Focus on only one data type: time series. • Defined as • Scalar x(t),x(t+1), … • Vector Bx(t),By(t),Bx(t+1),By(t+1),… • Spectrogram A1(t),A2(t),…,AN(t),…, A1(t+1),A2(t+1),…AN(t+1)

Development history • Developed as a part of ViRBO • Built on OPeNDAP

Codebase • Java • OPeNDAP • Have written “I/O Service Provider” for data files. • Added ability to do pass time constraint expressions • Added ability to output data as an ASCII table • Added basic filters

Technical details • Each time series is stored as a single flat binary file with IEEE 754 floating point values. • Time series that are close to being on a uniform grid are re-gridded with fill values. • All time series use a single fill value of NaN. • Files are stored on a compressed file system. • Fast random access to compressed files. About 6x slower access speed, but compression ratio is usually 8. • Files are stored on a versioning file system. Only differences are stored.

API – lowest level • HTTP byte-range request http://timeseries.org/data/TimeSeries.ncml (contains data structure information and a URL to the science metadata) http://timeseries.org/data/TimeSeries.bin (just a time-ordered set of values Bx(t),By(t),Bx(t+1)By(t+1))

API –highest level DAP protocol (builds on HTTP) http://timeseries.org/data/TimeSeries.{ascii.bin,dods,dat,etc.}?time<1999:01:01 http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10 http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10&filter=5minboxcar

Future • Add submission API • Implement versioning file system • Implement suite of filters • Add ability to scale • Implement suite of applications • Connect to Universal Reader Library • Connect to QData set

A fast time series data server

A fast time series data server

Presentation Transcript

Time Series Data Analysis - II

Time-series data analysis

Fast Subsequence Matching in Time-Series Databases

Time Series Data

Time Series Data in MongoDB

Fast Approximate Correlation for Massive Time Series Data

Fast Time Series Classification Using Numerosity Reduction

Mining Time Series Data

Time Series Data Processes

Time Series Data Analysis - I

Analysis of Time Series Data

TIME SERIES MULTISENSOR SATELLITE DATA

Time Series Data

Fast Time Series Classification Using Numerosity Reduction

Aggregate Data and Time Series

Time-Series Data Management

Raw CTD Time Series data

Modeling Time Series Data

Regression with Time Series Data

Indexing Time Series Data