170 likes | 290 Views
Chapter 10: Stream-based Data Management. Title: Retrospective on Aurora Authors: Hari Balakrishnan, et. al. Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. Problem Problem Statement Why is this problem important?
E N D
Chapter 10: Stream-based Data Management • Title: Retrospective on Aurora • Authors: Hari Balakrishnan, et. al.
Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core • Problem • Problem Statement • Why is this problem important? • Why is this problem hard? • Approaches • Approach description, key concepts • Contributions (novelty, improved) • Assumptions
Problem Statement • Given • Stream data • Experience on the development of five stream-based applications using Aurora stream processing engine • Find: • Key requirements of streaming applications • Objectives • Reflect on the design of Aurora based on this experience • Eliminate the limitations and address new challenges on a follow-on project, Borealis • Constraints • Data streams arrive in no particular order. • Data streams arrive without any temporal regularity.
Why is this problem important? • Stream-processing applications • Financial Services – stock ticker • Transportation – congestion pricing, dynamic tolls • Sensor Networks – Environment monitoring • Defense – Battalion monitoring
Why is this problem Hard? • High update rate • Time-series • Streaming applications entail time series. • Time series operations are not well supported by current DBMSs. • Real-time constraints • Outbound processing, where data are stored before being processed, cannot deliver real-time latency. • SPEs must adopt inbound processing, where query processing is performed directly on incoming messages. • Spikes in message load. • Incoming traffic is bursty. • Quality of Service (QOS) requirements
Novel Contributions • Comparison with SQL-centric related Work: • Data Flow Network (DFN) centric • Developer – compose DFN using graphical user interface • Optimizer – rearrange DFN, e.g. swap boxes, • Compiler – Translate DFN to intermediate representation • Run-time – Schedule tasks based on QOS requirements • Other Contributions – Lessons Learnt • Identify characteristics of streaming applications • from 5 case studies • Identify core performance tuning ideas
Aurora Architecture • Aurora is based on a dataflow-style ‘boxes & arrows’ paradigm unlike others using SQL style query interface. (i.e., performing query back and forth adds system overhead and latency.) • Can be spread across any number of machines for scalability and availability. Input Operator Output Aurora Operators Aurora GUI
Aurora Case Study 1: Financial Services • An application detects feed problems and triggers switch between feeds in real time. • Hierarchical Alarm • Low alarm is triggered when update is delayed beyond threshold (e.g., 5 sec). • High alarm is triggered when low alarms accumulate beyond threshold (e.g., 100 times). • Boxes in red circle separate the alarms from both Reuters and Comstock into alarms from NYSE and alarms from NASDAQ. Filter & Merging techniques • This case study illustrates the ability to detect stream imperfections and extend functionality using user-defined Map functions.
Aurora Case Study 2: Linear Road Benchmark • Linear Road is a bench mark for stream processing eingines. • Simulates an unban highway system that uses ‘variable tolling’ (i.e, congestion-based pricing). • Linear Road should support for • Two continuous queries • Calculates a segment toll every time a vehicle enters the segment. • Detects and reports accidents and adjusts tolls accordingly. • Three Historical queries • Request an account balance • Day’s total expenditure for a given vehicle • Prediction of travel time between two segments using historical data • Each of these queries must be answered with a specified accuracy and within a specified response time.
Aurora Case Study 3: Battalion Monitoring • Aircrafts gather data and send them to monitoring stations. • Enemy units cross a given line, signaling an attack. • The limited resource is the bandwidth between aircraft and ground. When an attack is initiated, selective dropping of data is allowed to serve important classes. • Authors could test their load-shedding techniques. • Insert random drop boxes to discard a fraction of their input tuples. • Insert semantic, predicate-based drop filters. • Observations • The semantic load-shedding techniques achieve the least value utility loss. • As load increases, two techniques show similar performance. • At high loads, all algorithms converge to same loss levels.
Aurora Case Study 4: Environmental Monitoring • Monitoring toxins in water. • Stream data is fish behavior (e.g., breathing rate) and water quality (e.g., temperature). • When the fish behave abnormally, an alarm is sounded. • The water data contain 1,2, and 4 hour sliding windows. • Ease of developing stream applications • Aurora proved very convenient for sliding window calculation. • Aurora’s GUI proved invaluable.
Aurora Case Study 5: Medusa • Is a distributed stream-processing system using Aurora. • Takes Aurora queries and distributes them across multiple nodes. • Offers several Benefits: • Incremental scalability over multiple nodes. • High availability by mutual monitoring between nodes. • Composition of stream feeds from different participants. • Handling load spikes by federated system.
Lessons Learnt: Application Characteristics • Common Queries • Historical data using Open window • Last 10 week’s worth of toll data for each driver • Aggregate - How much a driver has spent on tolls over past 10 weeks? • Tables of historical data with arbitrary update patterns • Synchronization • Stream applications rely on shared data and computation. • WaitFor (P: Predicate, T: Timeout) • Unpredictable stream behavior • Financial services application detects arrival rate of a stream. • Military application adjust resources during times of stress.
Lessons Learnt: Performance Tuning • Requirements • Main memory implementation • Data movement across DFN elements • Scheduling of DFN elements • Performance Decisions • Memory copying – memcpy() implementations • Scheduler • Reduce scheduler overheads by aggressive profiling • Tight loops • keep unnecessary house-keeping out of tight loops • Data-structures • Optimize data-structures used to implement DFN elements
Future Plans: Borealis • Dynamic revision of query results • Intelligently corrects query results that have already been emitted with the corrected data that arrive later. • Dynamic query modification • E.g., traders wish to be alerted of interesting events, where the def’n of ‘interesting’ varies. • Distributed optimization • Server-heavy or sensor-heavy optimization problem becomes emerging. • More flexible optimization to handle a very large # of devices • Implementation plans
Summary • Paper’s focus • Identify the requirements of stream applications by the experience from the design and implementation of Aurora stream-processing engine • Ideas • Describe five applications and their implementation in detail. • Reflect on the design of Aurora based on the experience. • Discuss future ideas on follow-on project. • Contributions • Identify key requirements of streaming applications • Analytical Validation • Case study
Assumptions, Rewrite today • Assumptions • Archiving is not necessary! • Performance more important than declarative query language • Rewrite today • Compare performance with competition, e.g. STREAM • Allow archiving along with stream processing • Consider other applications • RFID, cell phone applications • Include current status of Borealis implementation.