1.08k likes | 1.2k Views
Continuous Queries and Publish/Subscribe. Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam Speert. Presentation Outline. 1:40 - 1:50 Introduction to Pub/Sub 1:50 - 2:25 EXPRESSION/EVALUATE 2:25 - 3:10 TAPESTRY 3:10 - 3:20 Break
E N D
Continuous Queries and Publish/Subscribe Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam Speert
Presentation Outline • 1:40 - 1:50 Introduction to Pub/Sub • 1:50 - 2:25 EXPRESSION/EVALUATE • 2:25 - 3:10 TAPESTRY • 3:10 - 3:20 Break • 3:20 - 4:00 Efficient Filtering of XML • 4:00 - 4:30 Discussion
References • Aguilera, M., Strom, R., Sturman, D., Astley, M., and Chandra, T. “Matching Events in a Content-based Subscription System”. 18th ACM Symposium on Principles of Distributed Computing, 1999: 53-61 • Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J. “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe System”. ACM SIGMOD 2001: 115-126. • Yalamanchi, A., Srinivasan, J., Gawlick, D. “Managing Expressions as Data in Relational Database Systems”. Proceedings of the 2003 CIDR Conference.
References (Continued) • Douglas B. Terry, David Goldberg, David Nichols and Brian M. Oki. Continuous Queries over Append-Only Databases. SIGMOD Conference 1992: 321-330.
What is Publish/Subscribe? • Connecting information providers to information consumers • Delivering published events to subscribers who expressed interests in such events.
Group-Based • Publishers categorize events into fixed groups such as subjects, channels, or topics. • Subscribers choose certain a group(s) and then obtain every event in the specified group(s).
Content-Based • No pre-defined groups - subscribers can filter criteria along multiple dimensions. • No administrative overhead for publishers to maintain groups.
Issues • Large number of subscribers • Large number/high rates of events • Minimize delay • Simple and expressive subscription • Changing subscription
How does it relate to us? • Subset of stream database with only select queries • Events in form of streams • Subscribers’ interests in form of continuous queries or possibly as another stream
Managing Expressions as Data in Relational Database Systems Aravind Yalamanchi Jagannathan Srinivasan Dieter Gawlick
First, a note on efficiency… Save: • time • space • computation • computation resources
Motivation • Sophisticated Publish/Subscribe systems do not currently exist • Authors propose approach to Pub/Sub that allows greater expressiveness
Idea: I want a list of movies meeting the following criteria: • G Rating • Cartoon • Shorter than 3 hours Could execute a query, or…
Different Approach … we could store the query (as data)! • Arrival of new data search DB for queries that are satisfied • Notify users who issued matching queries
Definition: Expressions • Boolean conditions characterizing user’s interests • Syntax-equivalent to SQL-WHERE clause • Stored in a column of database table • Treated as data
More on Expressions • can reference variables and functions • Example: Tivo/Digital Guide ratingReview(movie) = 4 and Day = 6 and Length(movie) < 200; • finds movies that received 4 stars, less than 200 minutes, and are being shown on day 6 (Saturday)
Expression Data Type • Conditional expression stored as text • Associated with Expression Set (set of expressions in db) Metadata (variables and functions common to set)
Expression Metadata • Expressions share a common list of variables • Metadata made up of these variables, their data types and functions • Expressions can use this Metadata in their predicates
Expression Metadata Cont’d • Metadata determines evaluation context of expressions in a column • Insertion or update to expression: expression checked against metadata to ensure its validity
Expression Metadata Variable names and their data type List of user-defined Function
Definition: EVALUATE • An operation comparable to (eval) in LISP/Scheme • Equivalent SQL Query • metadata in FROM clause • expression in WHERE clause SELECT 1 FROM (SELECT :Genre as Genre, :Rating as Rating FROM <any_table_with_one_row>) WHERE Genre = ‘Comedy’ and Rating = ‘PG’
EVALUATE Operator • Evaluates an expression into 0 or 1. • Takes 2 arguments • text representation of expression • Data Item (String or AnyData type) • SELECT * FROM consumer WHERE EVALUATE( consumer.interest, • AnyData.convertObject( • Movie(‘Comedy’,’PG’))= 1 • SELECT * FROM consumer WHERE EVALUATE( • consumer.interest, • ‘Genre => “COMEDY”, • Rating => “PG”’ ) = 1
Use of EVALUTE in SQL • Use of aggregates to allow complex expressions. • using GROUP BY or ORDER BY on the customer’s other attributes • SELECT * FROM consumer • WHERE EVALUATE(consumer.interest, • <movie details> ) = 1 • Order BY consumer.age
Use of EVALUTE (continued) • Can be combined with other predicates to allow “Mutual Filtering” (Publishers can filter data, not only Subscribers) • notifying only customers with age >= 17 about the R-rated movies
Use of EVALUTE (continued) • Perform JOINS on multiple relations • order the movies in the newMovie table by the number of subscribers interested in the type of the movie (using count(*) in SQL) • SELECT DISTINCT (newMovie.id), count(*) as demand FROM consumer, newMovie • WHERE EVALUATE(consumer.interest, • <movie details from newMovie table> ) = 1 • GROUP BY newMove.id • ORDER BY demand DESC
Use of EVALUTE (continued) • Maintain complex (many-to-many) relationship between multiple tables. • joining customer and movie tables to create a listing of all the actual movies from movie table that each customer is interested in.
Expression Indexing • As in previously seen systems, Oracle developers exploit commonalities • Large set of expressions many will tend to share common parts of their predicates • Index defined on group of expressions in column to process more “efficiently” with EVALUATE
Expression Indexing Cont’d • Exploitation of commonalities leads to more “efficient” processing of expressions • Processing costs shared across multiple predicates • Example: • Given 2 predicates: one wants movie rated PG and one wants movie rated R • If one predicate is satisfied, the other is not
Exploitation of Commonalities • Logical grouping of predicates by their LHS • ( length(movie) < 2; group on length(movie)) LHS OP RHS • Example: want movies shorter than 3 hours and movies shorter than 2 hours, • movies shorter than 2 hours satisfy both predicates
Predicate Table • Predicates are grouped by LHS • Contains 1 row for each expression
Index Processing • Data’s attribute compared with the RHS and OP columns • Example: SELECT exp_id from predicate_table WHERE G2_OP = ‘=‘ and G2_RHS = :rhs_val OR G2_OP = ‘>’ and G2_RHS < :rhs_val OR …
Index on Predicate Table • We can have an index on the Predicate Table. • Processing: 1. Indexed predicate group 2. Stored predicate group 3. Sparse predicate group Some expressions are filtered out after each step - Less to check with slower predicate group.
Performance Characterization • Oracle says: preliminary experiments promising results for the EFI Scheme • Performed best when “fine-tuned” for given expression set-- Tunable Characteristics of an index: • list of common predicates • list of common ops for these preds • # of indexed predicates
Future Work • Native support needed for: • Expression data type • EVALUATE operator
Why Does It Matter?(What to take away) • Expression/EVALUATE allows us to list all interested subscribers for a data by just one query. • Makes it easier to scale for large number of subscribers. • Using Relational Database Management system for Pub/Sub.
Discussion Sneak Preview • What if we don’t have all information? Must conform to metadata, so now what? • Have we seen Pub/Sub before? What about myYahoo portals? • So, if RDMS can be extended to handle Pub/Sub, is this the limit?
Continuous Queries over Append-Only Databases Using the Tapestry System
Talk Topics • Continuous Queries • TQL / Tapestry • Query Transformations • Results • Conclusions
Continuous Queries • What does ‘continuous’ mean? (Continuous Semantics) • The results of a continuous query is the set of data that would be returned if the query were executed at every instant in time. • Why an ‘append-only’ database? • Users are being presented with answers, and we don’t want a data item to be an answer at one point, but not an answer later in time.
Periodic Query Example • Show all e-mail messages that have received a reply.
Problems with Periodic Queries • Non-deterministic results • If two people make this query, will they obtain the same results? • Duplicates can be returned • Every time the query is run, the same results might be returned. • Inefficiency of the system • The query is run over all data, not new data, resulting in slow processing time do to un-necessary work being done by the system.
The Tapestry System • A system that keeps a database of all mail and news messages received. • Designed to filter mail and newsgroup messages. • Allows users to query over this database. • Since Tapestry doesn't use triggers (time is used), it can be implemented in any commercial database that supports SQL applications.
TQL • TQL – Tapestry Query Language • Works much like SQL, with the addition of a timestamp. • Every query uses DISTINCT. • No aggregates at this time. • Does not support outer joins. • TQL monotone incremental SQL
TQL Advantages • Allows ad hoc queries. • Can be used in a conventional relational database. • Time-oriented queries are supported. • Flexible scheduling of query running order.
Continuous Query Example • Show all e-mail messages that have received a reply, and that have not been returned by this query before. • Method: • Subtract old answers from new answers and return this new set of data. • This is done because some past answers might be included in the ‘new’ data.
Incremental Query Example • Show all e-mail messages that have received a reply, and that have not been returned by this query before. (Same query as before) • Method: • Run the query on the new data that has arrived, not the entire database or by comparing to past answers.
Solutions • Incremental Queries can eliminate duplicates. • Timestamp • Query Transformations • To run continuous queries and solve the problem of non-determinism, TQL was developed.
Query Transformation • Q(t) QM(t) QI(t, τ) • QM(t) – monotone query • QI (t, τ) – incremental query • Monotone query: QM(t1) QM(t2) when t1 <t2 • Minimum bounding monotone query • Incremental query: QM(t)-QM(τ) QI(t, τ) QM(t) • Should return enough, but not too much
Monotone queries • QM(t) = Us<=t Q(s) • Monotone query result = • = regular query + continuous semantics • Not all queries are monotone, but for an append-only database many queries are