NiagaraCQ: A Scalable Continuous Query System for Internet Databases

NiagaraCQ: A Scalable Continuous Query System for Internet Databases CS561 Presentation Xiaoning Wang

Presentation Outline • Why NiagaraCQ? • What’s NiagaraCQ? • Details • Experiments

We need Continuous Queries • Allows users to obtain new results from a database without having to issue the same query repeatedly. • Especially useful in an environment like the Internet comprised of large amounts of frequently changing information. • Need to be able to support millions of queries due to the scale of the Internet.

What’s NiagaraCQ? • A distributed database system for querying distributed XML data sets using a query language like XML-QL. • It allows a very large number of users to be able to register continuous queries in a high-level query language such as XML-QL.

Anything else? Novelty: query optimization strategy

? What’s the Approach • Group!!! • Assumption

Advantages of Grouping • Group optimization has the following benefits: • Grouped queries can share computations. • Common execution plans of grouped queries can reside in memory, significantly saving on I/O costs compared to executing each query separately. • Grouping makes it possible to test the “firing” conditions of many continuous queries together, avoiding unnecessary invocations.

Can this traditional grouping tech be applied into Continuous Query directly? NO!!!

Why not? • Previous group optimization efforts focused on finding an optimal plan for a small number of similar queries. • Not applicable to a continuous query system for the following reasons: • Computationally too expensive to handle a large number of queries. • Not designed for an environment like the web.

Uniqueness of NiagaraCQ Incremental group optimization approach

NiagaraCQ • Supports scalable continuous query processing over multiple, distributed XML files by deploying incremental group optimization.

Incremental group optimization • It’s a novel approach in which queries are grouped according to their signatures. • Groups are created for existing queries according to their signatures, which represent similar structures among the queries. • Groups allow the common parts of two or more queries to be shared. • Each individual query in a query group shares the results from the execution of the group plan.

Cont. When a new query arrives, existing groups are considered as possible optimization choices instead of re-grouping all the queries in the system. • New query is merged into existing groups whose signatures match that of the query. • Existing queries are not re-grouped

query-split scheme • Incremental group optimization scheme employs a query-split scheme. • After the signature of a new query is matched, the sub-plan corresponding to the signature is replaced with a scan of the output file produced by the matching group. • Optimization process then continues with the remainder of the query tree is a bottom-up fashion until the entire query has been analyzed. • If no group “matches” a signature of the new query, a new query group for this signature is created in the system. • Thus, each continuous query is split into several smaller queries such that inputs of each of these queries are monitored using the same techniques that are used for the inputs of user-defined continuous queries.

Cont. • Advantages for query-split scheme: • Main advantage is that it can be implemented using a general query engine with only minor modifications. • Anther advantage is that the approach is very scalable.

Cont. • Since queries are continuous being added and removed from groups, over time the quality of the group can deteriorate, leading to a reduction in the overall performance of the system. • One or more groups may require “dynamic regrouping” to re-establish their effectiveness.

Types of Continuous Queries • Change-base continuous queries • Fired as soon as new relevant data becomes available. • i.e. In stocks, day-traders would probably want to know the desired price information immediately. • Provide better response time, but waste system resources when instantaneous answers are not really required.

Types of Continuous Queries • Timer-based continuous queries • Executed only as time intervals specified by the submitting user. • i.e. In stocks, long-term investors may be satisfied being notified every hour. • Can be supported more efficiently, thus query systems that support these continuous queries should be more scalable. • However, since users can specify various overlapping time intervals for their continuous queries, grouping timer-based queries is much more difficult than grouping purely change-based queries.

NiagaraCQ (cont.) • Considering only the changed portion of each updated XML file and not the entire file. • Monitor and detect data source changes using both push and pull models on heterogeneous sources. • Caching mechanism is used to obtain good performance with limited amounts of memory.

Using Expression Signature • Represent the same syntax structure, but possibly different constant values, in different queries. • Specific implementation of the signature concept.

Group • Group signature = Quotes.Quote.Symbol in quotes.xml constant

Group • Group constant table Constant_value Destination_buffer

Group plan • Previous query plan Trigger Action I Trigger Action J Select Symbol=“INTC” Select Symbol=“INTC” File Scan File Scan Quotes.xml Quotes.xml

Group plan • New plan …… Trigger Action I Trigger Action J Split Join Symbol=Constant_value File Scan File Scan Quotes.xml Constant Table

Buffer design • The destination buffer for the split operator is needed. • Pipelined scheme • Intermediate scheme

Pipelined scheme ……. Operator buffer Operator buffer Split

Disadvantage of pipeline • Such a scheme doesn’t work for grouping timer-based CQ’s. It’s difficult for a split operator to determine which tuple should be stored and how long they should be stored for. • The query structure is a directed graph, not a tree and hence the plan may be too complicated for a general XML-QL query engine to execute • The combine plan may be very large requires resources beyond the limits of system. • A large portion of the query plan may not need to be executed at each query invocation. • One query may block many other queries.

Materialized Intermediate Files • Query Split with Materialized Intermediate Files Trig.Act.J Trig.Act.J File Scan …. File Scan File_j File_i Split

Advantages for this design • Every query is scheduled independently. Only the necessary queries are executed. • Queries after the Split operator will be in standard, tree-structured query format and thus can be scheduled and executed by a general query engine. • Each query in the system is about the size of a common user query, so that it can be executed without consuming an unusual amount of system resources. • This approach handles intermediate files and original data files uniformly. • The potential bottleneck of pipelined approach is avoided.

General Selection Predicates • Format: “Attribute op Constant” • Attribute is a path expression without wildcards in it. • Op includes “-”, “<”, “>”, because these formats dominate in the selection queries.

Problem for range-query groups • One potential problem for range-query groups is that the intermediate files may contain a large number of duplicated tuples because range predicated of the different queries might overlap.

Solution: Virtual Intermediate Files Trig.Act.J Trig.Act.J File Scan File Scan …. …… VI File_i Virtual Intermediate File_i Only one Real Intermediate File Split

Join Operators • Join operators are usually expensive, sharing common join operations can significantly reduce the amount of computation.

Advantage Disadvantage Solution: choose a better one based on a cost model Join first? Or selection first?

Timer-based Continuous Queries • Grouped in the same way as change-based queries except that the time information needs to be recorded at installation time. • Challenges • Hard to monitor the timer events of those queries. • Sharing the common computation becomes difficult due to the various time intervals. • Timer-based continuous queries fires at specific times, but only if the corresponding input files have not been modified.

Incremental Evaluation • Incremental evaluation allows queries to be invoked only on the changed data. • For each file, on which CQ’s are defined, NiagaraCQ keeps a “delta file” that contains recent changes. • Queries are run over the delta files whenever possible instead of their original files. • A time stamp is added to each tuple in the delta file. NiagaraCQ fetches only tuples that were added to the delta file since the query’s last firing time.

Memory Caching • Caching is used to obtain good performance with a limited amount of memory. • Caches query plans, system data structures, and data files for better performance.

What should be cached? • Grouped query plans. We assume that the number of query groups is relatively small. • Recently accessed file. • The event list for monitoring the timer-based events. But it can be large, so we keep only a “time window” of this list

Experimental Results • Hardware Settings • Conducted on a Sun Ultra 6000 with 1GB of RAM, running JDK1.2 on Solaris 2.6. • Data Sets • Database of stock information consisting of two XML files, “quotes.xml” and “companies.xml”. • “quotes.xml” contains stock information on about 5000 NASDAQ companies. The size of the file is about 2 MB. • “companies.xml” contains related company information. The size of the file is about 1 MB.

Comparison

System Status and Future Work • A prototype version of NiagaraCQ has been developed, which includes a Group Optimizer, Continuous Query Manager, Event Detector, and Data Manager. • Incremental group optimizer is still at a preliminary stage. However, incremental group optimization has been shown to be a promising way to achieve good performance and scalability. • Intend to extent incremental group optimization to queries containing operators other than selection and join (i.e. aggregation).

NiagaraCQ: A Scalable Continuous Query System for Internet Databases