Continuous Queries and Publish/Subscribe

Continuous Queries and Publish/Subscribe Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam Speert

Presentation Outline • 1:40 - 1:50 Introduction to Pub/Sub • 1:50 - 2:25 EXPRESSION/EVALUATE • 2:25 - 3:10 TAPESTRY • 3:10 - 3:20 Break • 3:20 - 4:00 Efficient Filtering of XML • 4:00 - 4:30 Discussion

References • Aguilera, M., Strom, R., Sturman, D., Astley, M., and Chandra, T. “Matching Events in a Content-based Subscription System”. 18th ACM Symposium on Principles of Distributed Computing, 1999: 53-61 • Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J. “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe System”. ACM SIGMOD 2001: 115-126. • Yalamanchi, A., Srinivasan, J., Gawlick, D. “Managing Expressions as Data in Relational Database Systems”. Proceedings of the 2003 CIDR Conference.

References (Continued) • Douglas B. Terry, David Goldberg, David Nichols and Brian M. Oki. Continuous Queries over Append-Only Databases. SIGMOD Conference 1992: 321-330.

What is Publish/Subscribe? • Connecting information providers to information consumers • Delivering published events to subscribers who expressed interests in such events.

Group-Based • Publishers categorize events into fixed groups such as subjects, channels, or topics. • Subscribers choose certain a group(s) and then obtain every event in the specified group(s).

Content-Based • No pre-defined groups - subscribers can filter criteria along multiple dimensions. • No administrative overhead for publishers to maintain groups.

Issues • Large number of subscribers • Large number/high rates of events • Minimize delay • Simple and expressive subscription • Changing subscription

How does it relate to us? • Subset of stream database with only select queries • Events in form of streams • Subscribers’ interests in form of continuous queries or possibly as another stream

Managing Expressions as Data in Relational Database Systems Aravind Yalamanchi Jagannathan Srinivasan Dieter Gawlick

First, a note on efficiency… Save: • time • space • computation • computation resources

Motivation • Sophisticated Publish/Subscribe systems do not currently exist • Authors propose approach to Pub/Sub that allows greater expressiveness

Idea: I want a list of movies meeting the following criteria: • G Rating • Cartoon • Shorter than 3 hours Could execute a query, or…

Different Approach … we could store the query (as data)! • Arrival of new data  search DB for queries that are satisfied • Notify users who issued matching queries

Definition: Expressions • Boolean conditions characterizing user’s interests • Syntax-equivalent to SQL-WHERE clause • Stored in a column of database table • Treated as data

More on Expressions • can reference variables and functions • Example: Tivo/Digital Guide ratingReview(movie) = 4 and Day = 6 and Length(movie) < 200; • finds movies that received 4 stars, less than 200 minutes, and are being shown on day 6 (Saturday)

Expression Data Type • Conditional expression stored as text • Associated with Expression Set (set of expressions in db) Metadata (variables and functions common to set)

Expression Metadata • Expressions share a common list of variables • Metadata made up of these variables, their data types and functions • Expressions can use this Metadata in their predicates

Expression Metadata Cont’d • Metadata determines evaluation context of expressions in a column • Insertion or update to expression: expression checked against metadata to ensure its validity

Expression Metadata Variable names and their data type List of user-defined Function

Definition: EVALUATE • An operation comparable to (eval) in LISP/Scheme • Equivalent SQL Query • metadata in FROM clause • expression in WHERE clause SELECT 1 FROM (SELECT :Genre as Genre, :Rating as Rating FROM <any_table_with_one_row>) WHERE Genre = ‘Comedy’ and Rating = ‘PG’

EVALUATE Operator • Evaluates an expression into 0 or 1. • Takes 2 arguments • text representation of expression • Data Item (String or AnyData type) • SELECT * FROM consumer WHERE EVALUATE( consumer.interest, • AnyData.convertObject( • Movie(‘Comedy’,’PG’))= 1 • SELECT * FROM consumer WHERE EVALUATE( • consumer.interest, • ‘Genre => “COMEDY”, • Rating => “PG”’ ) = 1

Use of EVALUTE in SQL • Use of aggregates to allow complex expressions. • using GROUP BY or ORDER BY on the customer’s other attributes • SELECT * FROM consumer • WHERE EVALUATE(consumer.interest, • <movie details> ) = 1 • Order BY consumer.age

Use of EVALUTE (continued) • Can be combined with other predicates to allow “Mutual Filtering” (Publishers can filter data, not only Subscribers) • notifying only customers with age >= 17 about the R-rated movies

Use of EVALUTE (continued) • Perform JOINS on multiple relations • order the movies in the newMovie table by the number of subscribers interested in the type of the movie (using count(*) in SQL) • SELECT DISTINCT (newMovie.id), count(*) as demand FROM consumer, newMovie • WHERE EVALUATE(consumer.interest, • <movie details from newMovie table> ) = 1 • GROUP BY newMove.id • ORDER BY demand DESC

Use of EVALUTE (continued) • Maintain complex (many-to-many) relationship between multiple tables. • joining customer and movie tables to create a listing of all the actual movies from movie table that each customer is interested in.

Expression Indexing • As in previously seen systems, Oracle developers exploit commonalities • Large set of expressions many will tend to share common parts of their predicates • Index defined on group of expressions in column to process more “efficiently” with EVALUATE

Expression Indexing Cont’d • Exploitation of commonalities leads to more “efficient” processing of expressions • Processing costs shared across multiple predicates • Example: • Given 2 predicates: one wants movie rated PG and one wants movie rated R • If one predicate is satisfied, the other is not

Exploitation of Commonalities • Logical grouping of predicates by their LHS • ( length(movie) < 2; group on length(movie)) LHS OP RHS • Example: want movies shorter than 3 hours and movies shorter than 2 hours, • movies shorter than 2 hours satisfy both predicates

Predicate Table • Predicates are grouped by LHS • Contains 1 row for each expression

Index Processing • Data’s attribute compared with the RHS and OP columns • Example: SELECT exp_id from predicate_table WHERE G2_OP = ‘=‘ and G2_RHS = :rhs_val OR G2_OP = ‘>’ and G2_RHS < :rhs_val OR …

Index on Predicate Table • We can have an index on the Predicate Table. • Processing: 1. Indexed predicate group 2. Stored predicate group 3. Sparse predicate group Some expressions are filtered out after each step - Less to check with slower predicate group.

Predicate Handling-Comparisons

Performance Characterization • Oracle says: preliminary experiments  promising results for the EFI Scheme • Performed best when “fine-tuned” for given expression set-- Tunable Characteristics of an index: • list of common predicates • list of common ops for these preds • # of indexed predicates

Future Work • Native support needed for: • Expression data type • EVALUATE operator

Why Does It Matter?(What to take away) • Expression/EVALUATE allows us to list all interested subscribers for a data by just one query. • Makes it easier to scale for large number of subscribers. • Using Relational Database Management system for Pub/Sub.

Discussion Sneak Preview • What if we don’t have all information? Must conform to metadata, so now what? • Have we seen Pub/Sub before? What about myYahoo portals? • So, if RDMS can be extended to handle Pub/Sub, is this the limit?

Continuous Queries over Append-Only Databases Using the Tapestry System

Talk Topics • Continuous Queries • TQL / Tapestry • Query Transformations • Results • Conclusions

Continuous Queries • What does ‘continuous’ mean? (Continuous Semantics) • The results of a continuous query is the set of data that would be returned if the query were executed at every instant in time. • Why an ‘append-only’ database? • Users are being presented with answers, and we don’t want a data item to be an answer at one point, but not an answer later in time.

Periodic Query Example • Show all e-mail messages that have received a reply.

Problems with Periodic Queries • Non-deterministic results • If two people make this query, will they obtain the same results? • Duplicates can be returned • Every time the query is run, the same results might be returned. • Inefficiency of the system • The query is run over all data, not new data, resulting in slow processing time do to un-necessary work being done by the system.

The Tapestry System • A system that keeps a database of all mail and news messages received. • Designed to filter mail and newsgroup messages. • Allows users to query over this database. • Since Tapestry doesn't use triggers (time is used), it can be implemented in any commercial database that supports SQL applications.

TQL • TQL – Tapestry Query Language • Works much like SQL, with the addition of a timestamp. • Every query uses DISTINCT. • No aggregates at this time. • Does not support outer joins. • TQL  monotone  incremental  SQL

TQL Advantages • Allows ad hoc queries. • Can be used in a conventional relational database. • Time-oriented queries are supported. • Flexible scheduling of query running order.

Continuous Query Example • Show all e-mail messages that have received a reply, and that have not been returned by this query before. • Method: • Subtract old answers from new answers and return this new set of data. • This is done because some past answers might be included in the ‘new’ data.

Incremental Query Example • Show all e-mail messages that have received a reply, and that have not been returned by this query before. (Same query as before) • Method: • Run the query on the new data that has arrived, not the entire database or by comparing to past answers.

Solutions • Incremental Queries can eliminate duplicates. • Timestamp • Query Transformations • To run continuous queries and solve the problem of non-determinism, TQL was developed.

Query Transformation • Q(t)  QM(t)  QI(t, τ) • QM(t) – monotone query • QI (t, τ) – incremental query • Monotone query: QM(t1)  QM(t2) when t1 <t2 • Minimum bounding monotone query • Incremental query: QM(t)-QM(τ)  QI(t, τ)  QM(t) • Should return enough, but not too much

Monotone queries • QM(t) = Us<=t Q(s) • Monotone query result = • = regular query + continuous semantics • Not all queries are monotone, but for an append-only database many queries are

Continuous Queries and Publish/Subscribe

Continuous Queries and Publish/Subscribe

Presentation Transcript

Continuous Time Markov Chains

Chapter 6:

Access Chapter 2

Analyzing Data For Effective Decision Making

Relational Algebra (end) SQL Queries

Adaptive Query Processing with Eddies

Job’s Method of Continuous Variation

Relational Algebra (end) SQL Queries

Continuous Delivery / Continuous Integration

Answering Queries Using Views: A Survey

Spatial Queries

An Incremental Sampling-based Algorithm for Stochastic Optimal Control

Extreme delivery : The real measure of progress is…

Using Action Queries

Using the Same Docker Container for Development and in the Cloud

Distributed Publish/Subscribe

Continuous Random Variables and Probability Distributions

Continuous Random Variables

Continuous Deployment with Gerrit and Jenkins