1 / 20

Cleaning and Processing Physical Time-Series Data

Cleaning and Processing Physical Time-Series Data. Stephen Dawson-Haggerty Computer Science Division , University of California, Berkeley stevedh@eecs.berkeley.edu. Introduction. Lots of sMAP frontends/apps growing up Enabled by architectural decoupling. Services Communicating with sMAP.

chapa
Download Presentation

Cleaning and Processing Physical Time-Series Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cleaning and Processing Physical Time-Series Data Stephen Dawson-Haggerty Computer Science Division, University of California, Berkeley stevedh@eecs.berkeley.edu

  2. Introduction Lots of sMAP frontends/apps growing up • Enabled by architectural decoupling Local Summer Retreat 2012

  3. Services Communicating with sMAP Many deployments share common infrastructure 6lowpan networks sMAP sMAP RS-458 bus control web models mgmt sMAP BacNET/IP Archiver RDBMS TSDB Lines of decoupling Local Summer Retreat 2012

  4. slicr: Tags Generate Multiple Views [ { tag : "Metadata/SourceName", restrict: "has Metadata/Extra/EndUse"}, { tag: "Metadata/Extra/EndUse"}, { tag: "Metadata/Extra/Category", defaultSubStream: "Properties/UnitofMeasure = 'mW'", seriesLabel:["Metadata/Location/Room", "Metadata/Extra/Load"]}, { tag: "Metadata/Extra/ProductType", defaultSubStream: "Properties/UnitofMeasure = 'mW'", seriesLabel:["Metadata/Location/Room", "Metadata/Extra/Load"]}, { tag: "Metadata/Instrument/PartNumber", defaultSubStream: "Properties/UnitofMeasure = 'mW'", seriesLabel:["Metadata/Instrument/PartNumber", "Metadata/Location/Room", "Metadata/Extra/Load”]}, "Properties/UnitofMeasure” ] Local Summer Retreat 2012

  5. query aggregate resample streaming pipeline insert Time-series Interface Bucketing Compression Storage mapper RPC readingdb Key-Value Store SQL Page Cache Lock Manager Storage Alloc. MySQL Local Summer Retreat 2012

  6. Motivation • Very common operations • Resample/subsample • Aggregate • Filter/smooth • Exploratory interactive analysis • Visualization • Recalibration/post-calibration • Get data into MATLAB • What are the right data access primitives with these facts in mind? Local Summer Retreat 2012

  7. Larry Ellison’s query • Extract data from 100 streams • Interpolate onto a 5-minute time basis • Combine the streams into a matrix • Filter missing data • Load into MATLAB/R/numpy applymissing < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012

  8. Design goals • Replace error-prone and poorly-performing application code • Windowed filters, merging • Optimize common data cleaning operations • i.e., materialize subsampled time-series for low latency • Enable actions on streaming, processed data • Work seamlessly on historical + streaming Local Summer Retreat 2012

  9. Approaches • SQL:2003 windowing functions SELECT OVER (ORDER BY time ROWS 10 PRECEEDING) • Language toolkits approaches • Python/pandas • R • Stata • MATLAB • Distributed frameworks • Pig, Hive • Database Approaches • SciDB Local Summer Retreat 2012

  10. Processing model pipes of operators • unix philosophy • process metadata alongside data each stage defines a new set of distillate streams Local Summer Retreat 2012

  11. What is an operator • An operator reads a set of input streams • And produces a set of distillate streams • May mutate any of the dimensions • Each output stream is uniquely named Example: unit Read a set of input streams and apply a common set of unit conversions unit (S, T, W)  (S, T, W) Local Summer Retreat 2012

  12. Processing model streams op time Dimensionality: - S: Streams - T: Time - W: “Width” (= 2) Type: OAT Unit: C ID: 1 Type: OAT Unit: Deg F ID: 2 Type: OAT Unit: C ID: 3 Local Summer Retreat 2012

  13. Operator construction • Specialize: bind to arguments op = add(10) • Instantiate: bind to stream meta-data; generate new metadata op([{id: 1, unit: “C”}, {id: 2, unit: “Deg F”}])  [{id: 47, unit: “C”}] • Process op([[[1337628952, 23]], [[1337628950, 70]]]) Local Summer Retreat 2012

  14. Operators are first class • Pass operators as arguments • window(mean, field=“minute”, width=15) • Apply the mean operator to windows in time 15 minutes long • This produces time vectors with a common timebase applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012

  15. Other operators work on streams • For instance, merging on timestamps • paste: (S, T, W)  (1, |unique(T)|, sum(W)) applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012

  16. Specialization • Operators specialize their type based on args • sum(axis=0): (S, T, W)  (S, 1, W) • sum(axis=1): (S, T, W)  (S, T, 2) applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012

  17. Operators can be pipelined • Since the outputs are just more streams sum(axis=1) < paste < window(count, field="minute", width=15) • Each operator mutates the output metadata • By default, you get the intersection of the shared metadata applysum(axis=1) < paste < window(count, field="minute", width=15) to data in ("4/20/2012", "4/21/2012") where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’ Local Summer Retreat 2012

  18. Implementation • Integrated into query language for locating/selecting streams • Data exploration and selection (Winter ‘12) • Integrated with backend time-series database (readingdb) • Range queries of time-series data at line rates (Summer ‘11) • Streaming output to clients Local Summer Retreat 2012

  19. Insights and key questions • Pushing tuples through operators is a good fit • But need to allow batching for efficiency • Heavy use of c libraries • Maintain provenance as metadata • Working with the time axis is different from the other axes • When do you produce a record? • Many optimizations are possible • Materialize subsampled view of data for interactive exploration Local Summer Retreat 2012

  20. Status Part of sMAP 2.0.300rc1! http://code.google.com/p/smap-data Local Summer Retreat 2012

More Related