200 likes | 374 Views
Cleaning and Processing Physical Time-Series Data. Stephen Dawson-Haggerty Computer Science Division , University of California, Berkeley stevedh@eecs.berkeley.edu. Introduction. Lots of sMAP frontends/apps growing up Enabled by architectural decoupling. Services Communicating with sMAP.
E N D
Cleaning and Processing Physical Time-Series Data Stephen Dawson-Haggerty Computer Science Division, University of California, Berkeley stevedh@eecs.berkeley.edu
Introduction Lots of sMAP frontends/apps growing up • Enabled by architectural decoupling Local Summer Retreat 2012
Services Communicating with sMAP Many deployments share common infrastructure 6lowpan networks sMAP sMAP RS-458 bus control web models mgmt sMAP BacNET/IP Archiver RDBMS TSDB Lines of decoupling Local Summer Retreat 2012
slicr: Tags Generate Multiple Views [ { tag : "Metadata/SourceName", restrict: "has Metadata/Extra/EndUse"}, { tag: "Metadata/Extra/EndUse"}, { tag: "Metadata/Extra/Category", defaultSubStream: "Properties/UnitofMeasure = 'mW'", seriesLabel:["Metadata/Location/Room", "Metadata/Extra/Load"]}, { tag: "Metadata/Extra/ProductType", defaultSubStream: "Properties/UnitofMeasure = 'mW'", seriesLabel:["Metadata/Location/Room", "Metadata/Extra/Load"]}, { tag: "Metadata/Instrument/PartNumber", defaultSubStream: "Properties/UnitofMeasure = 'mW'", seriesLabel:["Metadata/Instrument/PartNumber", "Metadata/Location/Room", "Metadata/Extra/Load”]}, "Properties/UnitofMeasure” ] Local Summer Retreat 2012
query aggregate resample streaming pipeline insert Time-series Interface Bucketing Compression Storage mapper RPC readingdb Key-Value Store SQL Page Cache Lock Manager Storage Alloc. MySQL Local Summer Retreat 2012
Motivation • Very common operations • Resample/subsample • Aggregate • Filter/smooth • Exploratory interactive analysis • Visualization • Recalibration/post-calibration • Get data into MATLAB • What are the right data access primitives with these facts in mind? Local Summer Retreat 2012
Larry Ellison’s query • Extract data from 100 streams • Interpolate onto a 5-minute time basis • Combine the streams into a matrix • Filter missing data • Load into MATLAB/R/numpy applymissing < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012
Design goals • Replace error-prone and poorly-performing application code • Windowed filters, merging • Optimize common data cleaning operations • i.e., materialize subsampled time-series for low latency • Enable actions on streaming, processed data • Work seamlessly on historical + streaming Local Summer Retreat 2012
Approaches • SQL:2003 windowing functions SELECT OVER (ORDER BY time ROWS 10 PRECEEDING) • Language toolkits approaches • Python/pandas • R • Stata • MATLAB • Distributed frameworks • Pig, Hive • Database Approaches • SciDB Local Summer Retreat 2012
Processing model pipes of operators • unix philosophy • process metadata alongside data each stage defines a new set of distillate streams Local Summer Retreat 2012
What is an operator • An operator reads a set of input streams • And produces a set of distillate streams • May mutate any of the dimensions • Each output stream is uniquely named Example: unit Read a set of input streams and apply a common set of unit conversions unit (S, T, W) (S, T, W) Local Summer Retreat 2012
Processing model streams op time Dimensionality: - S: Streams - T: Time - W: “Width” (= 2) Type: OAT Unit: C ID: 1 Type: OAT Unit: Deg F ID: 2 Type: OAT Unit: C ID: 3 Local Summer Retreat 2012
Operator construction • Specialize: bind to arguments op = add(10) • Instantiate: bind to stream meta-data; generate new metadata op([{id: 1, unit: “C”}, {id: 2, unit: “Deg F”}]) [{id: 47, unit: “C”}] • Process op([[[1337628952, 23]], [[1337628950, 70]]]) Local Summer Retreat 2012
Operators are first class • Pass operators as arguments • window(mean, field=“minute”, width=15) • Apply the mean operator to windows in time 15 minutes long • This produces time vectors with a common timebase applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012
Other operators work on streams • For instance, merging on timestamps • paste: (S, T, W) (1, |unique(T)|, sum(W)) applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012
Specialization • Operators specialize their type based on args • sum(axis=0): (S, T, W) (S, 1, W) • sum(axis=1): (S, T, W) (S, T, 2) applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012
Operators can be pipelined • Since the outputs are just more streams sum(axis=1) < paste < window(count, field="minute", width=15) • Each operator mutates the output metadata • By default, you get the intersection of the shared metadata applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012
Implementation • Integrated into query language for locating/selecting streams • Data exploration and selection (Winter ‘12) • Integrated with backend time-series database (readingdb) • Range queries of time-series data at line rates (Summer ‘11) • Streaming output to clients Local Summer Retreat 2012
Insights and key questions • Pushing tuples through operators is a good fit • But need to allow batching for efficiency • Heavy use of c libraries • Maintain provenance as metadata • Working with the time axis is different from the other axes • When do you produce a record? • Many optimizations are possible • Materialize subsampled view of data for interactive exploration Local Summer Retreat 2012
Status Part of sMAP 2.0.300rc1! http://code.google.com/p/smap-data Local Summer Retreat 2012