300 likes | 501 Views
Streaming Queries over Streaming Data. Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson. About Me. 3 rd Year ISYE major Minor in Computer Science From Austin, TX Have visited every state but Alaska
E N D
Streaming Queries over Streaming Data Sirish Chandrasekaran (UC Berkeley) Michael J. Franklin (UC Berkeley) Presented by Andy Williamson
About Me • 3rd Year ISYE major • Minor in Computer Science • From Austin, TX • Have visited every state but Alaska • Intern at Deloitte Consulting focusing on SAP implementation
Agenda • Background/Motivation • PSoup • Introduction • System Overview • Query Processing Techniques • Implementation • Performance • Aggregation Queries • Conclusions • Critique
Background/Motivation • Continuous Query (CQ) Systems • Treat queries as fixed entities and stream data over them • Previous systems only allowed streaming of either data or queries • Continuously deliver results as they are computed (infeasible/inefficient) • Data Recharging • Monitoring
PSoup: Introduction • Query processor based on Telegraph query processing framework • Allows both data and queries to be streamed • Partially stores results to support disconnected operation and improve data throughput and response time
PSoup: System Overview • User initially registers query specification with system • System returns handle which can be used to invoke results of query later • Example Query: SELECT * FROM Data_Stream D_s WHERE (D_s.a < x ^ D_s.b > y) BEGIN(NOW – 10) END(NOW); • Begin-End Clause allows: • Snapshot (constant beginning and ending time) • Landmark (constant beginning and variable ending time) • Sliding window (variable beginning and ending time) • Limited by size of memory
PSoup: System Overview • PSoup treats execution of query streams as a join of query and data streams • Maintains State Modules (SteMs) for queries and data • One query SteM for all queries in the system, and one data SteM for each data stream
PSoup: Query Processing Techniques • Overview • PSoup assigns unique queryID that it returns to the user • Client can disconnect, reconnect and execute query to obtain updated results • PSoup continuously matches data to query predicates in background and stores the results in its Results Structure • When a query is invoked, PSoup applies the appropriate input window to the Results Structure to return the current results
PSoup: Query Processing Techniques • Entry of new Query specs • New queries split into two parts: • Standing Query Clause (SQC): consists of the SELECT-FROM-WHERE clauses • BEGIN-END clause, stored in separate WindowsTable structure • SQC inserted into Query SteM • Used to probe Data SteMs corresponding to tables in FROM clause • Resulting tuples stored in Results Structure
PSoup: Query Processing Techniques • Entry of new data • New tuples assigned globally unique tupleID and physical timestamp (physicalID) based on system clock • Inserted into appropriate Data SteM • Then used to probe Query SteM to determine which SQCs it satisfies • TupleIDs and physicalIDs stored in Results Structure
PSoup: Query Processing Techniques • Selection Queries over a single stream
PSoup: Query Processing Techniques • Join Queries Over Multiple Streams
PSoup: Query Processing Techniques • Query Invocation and Result Construction • Results Structure maintains info about which tuples in Data SteM(s) satisfy which SQCs in Query SteM • For each result tuple of each query, it stores tupleID and physicalID of all constituent base tuples of result tuple • Results of a query can be accessed by its queryID • Ordered by timestamp (physicalID)
PSoup: Implementation • Eddy • Each tuple has a predicate attribute and an Interest List dictating where it is to be routed • Provides Stream Prefix Consistency by storing new and temporary tuples separately in New Tuple Pool and Temporary Tuple Pool • Begins by selecting a tuple from the NTP and then processing everything in the TTP before pickign another tuple from the NTP
PSoup: Implementation • Data SteM • Use tree-based index for data to provide efficient access to probing queries • One red-black tree for every attribute • Maintains hash-based index over tupleIDs for fast access
PSoup: Implementation • Query SteM • Allows sharing of work between queries that have overlapping FROM clauses • Use red-black trees to index single-attribute single-relation boolean factors of a query
PSoup: Implementation • Query SteM • For queries involving joins of multiple attributes, tree structure doesn’t work • Instead, a linked list called the predicateList is used • Query SteM contains an array in which each cell represents a query • At beginning of probe by a data tuple, each cell is set to the number of boolean factors in corresponding query • Every time tuple satisfies a boolean factor, cell value is decremented • At end of probe, if cell = 0, that means the data tuple satisfies the given query
PSoup: Implementation • Results Structure • Stores metadata indicating which tuples satisfy which SQCs • Can either be accomplished by previously-mentioned bitmap or by associating a linked list of satisfactory data tuples for each query • Ordering by timestamp is simple for single-table queries • For Join queries, typically use oldest timestamp
PSoup: Performance • Implemented in Java with customized versions of Eddy and SteMs • Examined performance of two versions: • PSoup-Partial (PSoup-P): Maintain results corresponding to SQCs in Results Structure, and apply BEGIN-END clauses to retrieve current results on query invocation • PSoup-Complete (PSoup-C): Continuously maintains results corresponding to current input window for each query in linked lists • NoMat: Measurements of a system that doesn’t materialize results
PSoup: Performance • Storage Requirements • NoMat: Storage cost = space taken to store base data streams within maximum window over which queries are supported, plus size of structures • PSoup-P: Storage cost = storage cost of NoMat + size of Results Structure (either bitarray or linked-list) • PSoup-C: Storage cost >> storage cost of PSoup-P since C always stores current results of standing queries at a given time
PSoup: Performance • Experimental Setup • Varied window sizes (27-216) and number(1-8)/type of boolean factors • Measured response time and maximum supportable data arrival rate • Examined both P and C with and without predicate indexes • Tested scheme to remove redundancies arising from joins • Used synthetic generated query(27-212) /data streams
PSoup: Performance • Response Time vs. Window Size
PSoup: Performance • Response Time vs. # Interval Predicates
PSoup: Performance • Data Arrival Rate vs. # SQCs
PSoup: Performance • Summary of Results • Materializing results of queries supports higher query invocation rates • Indexing queries and lazily applying windows improves maximum data throughput • PSoup-C requires more memory • PSoup-C optimizes query invocation rate • PSoup-P optimizes data arrival rate
PSoup: Performance • Removing Redundancy in Join processing • Entry of a query specification or new data • Composite tuples in joins
PSoup: Aggregation Queries • PSoup can support aggregate functions • Only possible to share data structures across queries with identical SELECT-PROJECT-JOIN clause
PSoup: Conclusions • Treats data and query streams analogously • Can support queries that require access to data that arrived before and after the query • Materializes results to cut down on response time and to support disconnected operation • Enables data recharging and monitoring • Future work: • Write data streams to disk and execute queries over them • Transfer queries between disk and memory, allowing query execution to be scheduled • Confront resource constraints when dealing with infinite streams • Query browser for temporal data
Critique • Strengths • Very well written, easy to follow • Clear examples, excellent explanation of performance results • Strong method that reduces processing time with increase in interval predicates • Weaknesses • Lacking sufficient data on storage costs • Experimentation only tested one multiple-relation boolean factor for joins; unrealistic • Didn’t address whether same (or similar) query could be entered twice and accidentally given two ID’s