150 likes | 287 Views
A fast time series data server. Bob Weigel George Mason University Status: In development. Motivation. Want to do fast large scale analysis on time series data Data volume and data processing speed often do matter! Speed enables many services.
E N D
A fast time series data server Bob Weigel George Mason University Status: In development
Motivation • Want to do fast large scale analysis on time series data • Data volume and data processing speed often do matter! • Speed enables many services. • If you want users to contribute data, provide • $, • free storage, • better organization and search than their OS + local file system provides, or • services on their data that are better than what they can do on their local machine.
The problems • Heliophysics “data bases” • The “granule” paradigm. The fundamental unit is the granule (file) contains many parameters. • The “small-box” paradigm. Given a user request, return a list of granule URLs that match. User needs to do the rest. Leads to slow-downs in response time to queries by up to a factor of 100! • Fundamental unit exposed to scientist should be the data set. Requires “aggregation”. Can be client-side or server-side. • Well-know and widely available RDBS don’t work well for time series (“column-based” versus “row-based”)
Approaches for Large Scale Analysis 0. Let the user do the “aggregation” • Service: The “run-on-demand” paradigm - A reader (or “accessor”) is developed for each data provider that downloads data to the user's computer, extracts the relevant parts, and puts the data in a uniform form in an array or structure in the user's software analysis program. • Disadvantages: Requires high server reliability (servers are typically run by scientists …). Higher sever load, higher data transfer volume. • Advantages: No additional disk space. Always up-to-date. • Service: The “pre-caching” approach - The data are stored in a uniform manner on an intermediate server. The user makes a request to a single server. • Disadvantages: more disk space. Cache may be out-of-date • Advantages: 5-100x speed-ups in response. Reliability (Errors are caught ahead-of-time as are server problems). Many new services will be enabled.
Ideal Approach Note that pre-caching requires “run-on-demand” solutions, but takes data a step further Note that “run-on-demand” approach will eventually develop a caching approach anyway – better to develop caching as a separate component => Use “pre-caching” for reasonably sized data sets. Use “run-on-demand” for large data sets and for filling cache . A significant portion of heliophysics data could be pre-cached.
Question Why hasn’t this been done before? • Looks like data centralization. • Without improved data base, improvements using existing infrastructure is incremental.
Only one data type • Focus on only one data type: time series. • Defined as • Scalar x(t),x(t+1), … • Vector Bx(t),By(t),Bx(t+1),By(t+1),… • Spectrogram A1(t),A2(t),…,AN(t),…, A1(t+1),A2(t+1),…AN(t+1)
Development history • Developed as a part of ViRBO • Built on OPeNDAP
Codebase • Java • OPeNDAP • Have written “I/O Service Provider” for data files. • Added ability to do pass time constraint expressions • Added ability to output data as an ASCII table • Added basic filters
Technical details • Each time series is stored as a single flat binary file with IEEE 754 floating point values. • Time series that are close to being on a uniform grid are re-gridded with fill values. • All time series use a single fill value of NaN. • Files are stored on a compressed file system. • Fast random access to compressed files. About 6x slower access speed, but compression ratio is usually 8. • Files are stored on a versioning file system. Only differences are stored.
API – lowest level • HTTP byte-range request http://timeseries.org/data/TimeSeries.ncml (contains data structure information and a URL to the science metadata) http://timeseries.org/data/TimeSeries.bin (just a time-ordered set of values Bx(t),By(t),Bx(t+1)By(t+1))
API –highest level DAP protocol (builds on HTTP) http://timeseries.org/data/TimeSeries.{ascii.bin,dods,dat,etc.}?time<1999:01:01 http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10 http://timeseries.org/data/TimeSeries.ascii?time<1999:01:01&value>10&filter=5minboxcar
Future • Add submission API • Implement versioning file system • Implement suite of filters • Add ability to scale • Implement suite of applications • Connect to Universal Reader Library • Connect to QData set