1 / 40

HiFi: Network-centric Query Processing in the Physical World

This presentation explores the challenges of processing large-scale data volumes generated by network-centric systems in the physical world. It discusses the hierarchical aggregation, filtering, and cleaning of data, as well as the integration across enterprises. The presentation also introduces the HiFi data management infrastructure, which provides a uniform and declarative framework for managing distributed receptor data.

egillis
Download Presentation

HiFi: Network-centric Query Processing in the Physical World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HiFi: Network-centric Query Processing in the Physical World Mike Franklin UC Berkeley SAP Research Forum February 2005

  2. Introduction • Receptors everywhere! • Wireless sensor networks, RFID technologies, digital homes, network monitors, ... Large-scale deployments will be as High Fan-In Systems Mike Franklin UC Berkeley EECS

  3. High Fan-in Systems The “Bowtie” Large numbers of receptors = large data volumes Hierarchical, successive aggregation Mike Franklin UC Berkeley EECS

  4. High Fan-in Example (SCM) Headquarters Regional Centers Warehouses, Stores Dock doors, Shelves Receptors Mike Franklin UC Berkeley EECS

  5. Properties • High Fan-In, globally-distributed architecture. • Large data volumes generated at edges. • Filtering and cleaning must be done there. • Successive aggregation as you move inwards. • Summaries/anomalies continually, details later. • Strong temporal focus. • Strong spatial/geographic focus. • Streaming data and stored data. • Integration within and across enterprises. Mike Franklin UC Berkeley EECS

  6. seconds Time Scale years Design Space: Time Archiving (provenance and schema evolution) Filtering,Cleaning,Alerts Monitoring, Time-series Data mining (recent history) Stream/Disk Processing On-the-fly processing Disk-based processing Mike Franklin UC Berkeley EECS

  7. local Geographic Scope global Central Office Regional Centers Several Readers Design Space: Geography Archiving (provenance and schema evolution) Filtering,Cleaning,Alerts Monitoring, Time-series Data mining (recent history) Mike Franklin UC Berkeley EECS

  8. tiny Individual Resources huge Clusters/ Grids Stargates/ Desktops Devices Design Space: Resources Archiving (provenance and schema evolution) Filtering,Cleaning,Alerts Monitoring, Time-series Data mining (recent history) Mike Franklin UC Berkeley EECS

  9. Degree of Detail Aggregate Data Volume Design Space: Data Archiving (provenance and schema evolution) Filtering,Cleaning,Alerts Monitoring, Time-series Data mining (recent history) Dup Elim history: hrs Interesting Events history: days Trends/Archive history: years Mike Franklin UC Berkeley EECS

  10. State of the Art • Current approaches: hand-coded, script-based • expensive, one-off, brittle, hard to deploy and keep running • Piecemeal/stovepipe systems • Each type of receptor (RFID, sensors, etc) handled separately • Standards-efforts not addressing this: • Protocol design bent • Different “data models” at each level • Reinventing “query languages” at each level  No end-to-end, integrated middleware for managing distributed receptor data Mike Franklin UC Berkeley EECS

  11. HiFi • A data management infrastructure for high fan-in environments • UniformDeclarative Framework • Every node is a data stream processor that speaks SQL-ese  stream-oriented queries at all levels • Hierarchical, stream-based views as an organizing principle Mike Franklin UC Berkeley EECS

  12. Why Declarative? (database dogma) • Independence: data, location, platform • Allows the system to adapt over time • Many optimization opportunities • In a complex system, automatic optimization is key. • Also, optimization across multiple applications. • Simplifies Programming • ??? Mike Franklin UC Berkeley EECS

  13. Building HiFi Mike Franklin UC Berkeley EECS

  14. Integrating RFID & Sensors (the “loudmouth” query) Mike Franklin UC Berkeley EECS

  15. A Tale of Two Systems • TinyDB • Declarative query processing for wireless sensor networks • In-network aggregation • Released as part of TinyOS Open Source Distribution • TelegraphCQ • Data stream processor • Continuous, adaptive query processing with aggressive sharing • Built by modifying PostgreSQL • Open source “beta” release out now; new release soon Mike Franklin UC Berkeley EECS

  16. SELECT MAX(mag) FROM sensors WHERE mag > thresh SAMPLE PERIOD 64ms App Query, Trigger Data TinyDB Sensor Network TinyDB • The Network is the Database: • Basic idea: treat the sensor net as a “virtual table”. • System hides details/complexities of devices, changing topologies, failures, … • System is responsible for efficient execution. • Developed on TinyOS/Motes http://telegraph.cs.berkeley.edu/tinydb Mike Franklin UC Berkeley EECS

  17. TelegraphCQ: Data Stream Monitoring • Streaming Data • Network monitors • Sensor Networks, RFID • News feeds, Stock tickers, … • B2B and Enterprise apps • Trade Reconciliation, Order Processing etc. • (Quasi) real-time flow of events and data • Manage these flows to drive business processes. • Can mine flows to create and adjust business rules. • Can also “tap into” flows for on-line analysis. http://telegraph.cs.berkeley.edu Mike Franklin UC Berkeley EECS

  18. Data Data Stream Processing Result Tuples Result Tuples Queries Queries Data Traditional Database Data Stream Processor • Data streams are unending • Continuous, long running queries • Real-time processing Mike Franklin UC Berkeley EECS

  19. Windowed Queries A typical streaming query Window Clause SELECT S.city, AVG(temp) FROM SOME_STREAM S [range by ‘5 seconds’ slide by ‘5 seconds’] WHERE S.state = ‘California’ GROUP BY S.city Window “I want to look at 5 seconds worth of data” “I want a result tuple every 5 seconds” Data Stream … Result Tuple(s) Result Tuple(s) Mike Franklin UC Berkeley EECS

  20. Shared Memory Query Plan Queue TelegraphCQ Back End TelegraphCQBack End TelegraphCQ Front End Eddy Control Queue Planner Parser Listener Modules Modules Split Query Result Queues Mini-Executor CQEddy CQEddy Proxy } Split Split Catalog Scans Scans Shared Memory Buffer Pool Wrappers TelegraphCQ Wrapper ClearingHouse Disk TelegraphCQ Architecture Mike Franklin UC Berkeley EECS

  21. TelegraphCQ TinyDB RFID Wrappers The HiFi System PC Stargates Sensor Networks & RFID Readers Mike Franklin UC Berkeley EECS

  22. DSQP MDR • HiFi Glue • DSQP Management • Query Planning • Archiving • Internode coordination • and communication HiFi Glue HiFi Glue HiFi Glue DSQP DSQP DSQP Basic HiFi Architecture • Hierarchical federation of nodes • Each node: • Data Stream Query Processor (DSQP) • HiFi Glue • Views drive system functionality • Metadata Repository (MDR) Mike Franklin UC Berkeley EECS

  23. Analyze Join w/Stored Data Validate Arbitrate Multiple Receptors Smooth Window Clean Single Tuple HiFi Processing Pipelines The CSAVAFramework On-line Data Mining CSAVA Generalization Mike Franklin UC Berkeley EECS

  24. CSAVA Processing Clean CREATE VIEW cleaned_rfid_stream AS (SELECT receptor_id, tag_id FROM rfid_stream rs WHERE read_strength >= strength_T) Mike Franklin UC Berkeley EECS

  25. CSAVA: Processing Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T) Clean Mike Franklin UC Berkeley EECS

  26. CSAVA: Processing Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id)) Smooth Clean Mike Franklin UC Berkeley EECS

  27. CSAVA: Processing Validate CREATE VIEW validated_tags AS (SELECT tag_name, FROM arbitrated_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’], known_tag_list tl WHERE tl.tag_id = rs.tag_id Arbitrate Smooth Clean Mike Franklin UC Berkeley EECS

  28. CSAVA: Processing Analyze CREATE VIEW tag_count AS (SELECT tag_name, count(*) FROM validated_tags vt [range by ‘5 min’, slide by ‘1 min’] GROUP BY tag_name Validate Arbitrate Smooth Clean Mike Franklin UC Berkeley EECS

  29. Ongoing Work • Bridging the physical-digital divide • VICE – A “Virtual Device” Interface • Hierarchical query processing • Automatic Query planning & dissemination • Complex event processing • Unifying event and data processing Mike Franklin UC Berkeley EECS

  30. “Metaphysical* Data Independence” Virtual Device (VICE) Layer *The branch of philosophy that deals with the ultimate nature of reality and existence. (name due to Shawn Jeffery) Mike Franklin UC Berkeley EECS

  31. The Virtues of VICE • A simple RFID Experiment • 2 Adjacent Shelves, 8 ft each • 10 EPC-tagged items each, plus 5 moved between them. • RFID antenna on each shelf. Mike Franklin UC Berkeley EECS

  32. Ground Truth Mike Franklin UC Berkeley EECS

  33. Raw RFID Readings Mike Franklin UC Berkeley EECS

  34. After VICE Processing Under the covers (in this case): Cleaning, Smoothing, and Arbitration Mike Franklin UC Berkeley EECS

  35. Other VICE Uses • Once you have the right abstractions: • “Soft Sensors” • Quality and lineage streams • Pushdown of external validation information • Power management and other optimizations • Data Archiving • Model-based sensing • “Non-declarative” code • … Mike Franklin UC Berkeley EECS

  36. Hierarchical Query Processing • Continuous and Streaming • Automatic placement and optimization • Hierarchical • Temporal granularity vs. geographic scope • Sharing of lower-level streams “I provide national monthly values for the US” “I provide avg weekly values for California” “I provide avg daily values for Berkeley” “I provide raw readings for Soda Hall” Mike Franklin UC Berkeley EECS

  37. Complex Event Processing • Needed for monitoring and actuation • Key to prioritization (e.g., of detail data) • Exploit duality of data and events • Shared Processing • “Semantic Windows” • Challenge: a single system that simultaneously handles events spanning seconds to years. Mike Franklin UC Berkeley EECS

  38. Next Steps • Archiving and Detail Data • Dealing with transient overloads • Rate matching between stored and streaming data • Scheduling large archive transfers • System design & deployment • Tools for provisioning and evaluating receptor networks • System monitoring & management • Leverage monitoring infrastructure for introspection Mike Franklin UC Berkeley EECS

  39. Conclusions • Receptors everywhere  High Fan-In Systems • Current middleware solutions are complex & brittle • Uniform declarative framework is the key • The HiFi project is exploring this approach • Our initial prototype • Leveraged TelegraphCQ and TinyDB • Demonstrated RFID/multiple sensor integration • Validated the HiFi approach • We have an ambitious on-going research agenda • See http://hifi.cs.berkeley.edu for more info. Mike Franklin UC Berkeley EECS

  40. Acknowledgements • Team HiFi: Shawn Jeffery, Sailesh Krishnamurthy, Frederick Reiss, Shariq Rizvi, Eugene Wu, Nathan Burkhart, Owen Cooper, Anil Edakkunni • Experts in VICE: Gustavo Alonso, Wei Hong, Jennifer Widom • Funding and/or Reduced-Price Gizmos from NSF, Intel, UC MICRO program, and Alien Technologies Mike Franklin UC Berkeley EECS

More Related