430 likes | 634 Views
KapQuilt: Semantic Caching for Quilt Queries --- A New Quilt Query Answerable By Cached Ones?. Li Chen. Outline. Background Motivation Goals: semantic cache for quilt queries Overall task list Immediate task Approaches for containment and rewriting Case studies Module design
E N D
KapQuilt: Semantic Caching for Quilt Queries --- A New Quilt Query Answerable By Cached Ones? Li Chen
Outline • Background • Motivation • Goals: semantic cache for quilt queries • Overall task list • Immediate task • Approaches for containment and rewriting • Case studies • Module design • Timetable
Background query efficiency query quality query performance dynamically decide mat views control concurrency of queries & updates Argos! :-)) ECA, query optimization semantic cache sweep mat view maintenance answer queries using views data mining database design containment theory & algo web site integration data warehousing web site management views independence of physical & logical data
Dimensions of Semantic Caching languages SQL - simple select-project-join SPJ - group, aggregation, query blocks - datalogs OQL TSL Quilt rewritten query query plan containment relationships fully contained max-contained outcomes
Motivations • What’s new about semantic caching? • Web proxies just cache web page hits, not real computed queries • Web information integration needs expressive XML queries • Semantic caching for XML queries is new • Quilt is a full capability XML query language, promising for the integration of web info, and kweelt is a quilt query engine implemented!
Goals • Goal! build a SC system for quilt queries • to better answer populate queries • quicker, less expansive and more up-to-date • KapQuilt comes to rescue limitation we start from the core subset of quilt queries while ignoring nesting queries and regular expression queris for now
q1 q2 q3 Cached Views XML Source 2 XML Source n XML Source 1 KapQuilt System Architecture KSP Client Kweelt Engine Kweelt API remote query requests Parser KapQuilt Query Decomposer DTDM Query Matcher CIS Query Rewriter Cost Estimator PQ Query plans Evaluator RQ DOM … DOM Other Node Factories XML Parser Parser Wrapper …... Doc RDB
Task List • Answer whether a query is computable by cached ones • If answerable, compute PQ (probe query) and RQ (remainder query) • If many PQ candidates, pick the one benefits most • Decide whether a query is worth to cache, when to cache • In case of cache space limitation, apply replace policy • Decompose and coalesce the query segments in cache • Concurrency control of queries and updates • Analyze costs in various web query archs • ?Keep cached view always fresh • integrate Argos with KapQuilt
Immediate Task!-- MQP project goal as well • Design and impl core functions of KapQuilt • input • a set of cached queries S={s1,s2...} • a new query q • output • a probe query (PQ) • might be null if not answerable at all • if not null, PQ Ac (s1 s2 … sn) • a reminder query (RQ) • might be null if q fully contained in S • if not null, RQ go down to query against data sources
Approaches • Analyze quilt query process and its variable binding mechanism • Set up cache index structure (CIS) to represent elements of a quilt query • Warm up cache by initializing CIS with decomposed queries • Implement the query containment and rewriting algorithm for quilt • Conduct experimental studies for cost analysis • Integrate with Argos system for cached view maintenance
A Taxonomy for XML Query SQL XQL OQL XPointer XSL Patterns XML-QL XQL-99 XPath Quilt
Briefs on Quilt • Quilt is a functional language • A query is an expression, composed of • FLWR Expressions FOR ... LET ... WHERE ... RETURN • Filters • XPath expressions document("bids.xml")//bid[itemno="47"]/bid_amount • Operators and functions • Element Constructors <bid> <userid> $u </userid> , <bid_amount> $a </bid_amount> </bid>
Data Flow in a FLWR Expression XML FOR/LET List of tuples of bound variables ($x = value, $y = value, $z = value), ($x = value, $y = value, $z = value), ($x = value, $y = value, $z = value) WHERE List of tuples of bound variables RETURN XML
Quilt Compared to XQL • A superset of XQL • Overcome shortcomings of XQL • no variable bindings, joins, transformations, ordering, aggregate functions, etc • no data integration from multiple XML sources • do semi-join, but in pretty non-intuitive syntax • Cover queries on structured document (including SGML), relational data even! book[author=//book[title='Moby Dick']/author]
Quilt Query Process • Variable binding is an important means • a query can define multiple variables, in order • dependency relationships exist among variables • a tuple list is bound to each variable, condition evaluation and return invocation are tuple-based • tuple lists are handles to data tree components, of which answer tree is composed
Cache Index Structure (CIS) • A structure to capture the essential elements of a quilt query • What’s essential elements of a quilt query? • variable bindings, conditions and returning nodes all refer to some element nodes in dtd. • a query can be identified by variable nodes V, return nodes T, condition nodes F and their dependency relationships • each element node in a dtd tree can be assigned a unique number (with unique absolute xpath)
Example DTD <?xml version="1.0"?> <!DOCTYPE bib [ <!ELEMENT bib (book* )> <!ELEMENT book (title, (author+ | editor+ ), publisher, price )> <!ATTLIST book year CDATA #REQUIRED > <!ELEMENT author (last, first )> <!ELEMENT editor (last, first, affiliation )> <!ELEMENT title (#PCDATA )> <!ELEMENT last (#PCDATA )> <!ELEMENT first (#PCDATA )> <!ELEMENT affiliation (#PCDATA )> <!ELEMENT publisher (#PCDATA )> <!ELEMENT price (#PCDATA )> ]> bib 1 book 2 year title author editor publisher price 5 7 12 19 21 3 last first last first affiliation 4 6 8 10 13 15 17 20 22 PCDATA PCDATA PCDATA CDATA 9 11 14 16 18 PCDATA PCDATA PCDATA PCDATA PCDATA
//book Quilt Query Sampler I v1 v1/author Q1 <bib> FOR $book IN document("bib.xml")//book[@year.>=.1991 AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book> </bib> v2 b1 b2 b3 variable nodes v1 /bib/book 2 2 y1 p1 y2 p2 y3 p3 v1 f1 () f2 () condition nodes 2 3 19 f1 v1 @ year.>.=1991 3 3 b1 y1[1993] 1 p1[…] 1 f2 19 v1 / publisher=“Addison-Wesley” b2 y2[1995] 1 p2[…] 0 19 b3 y3[1990] 0 p3[…] 1 return nodes t1 v1 @ year 3 3 t2 5 v1 / title t1 t1 t2 5 3 5 y1[1993] t1[…] /bib/book /book 2 r1 publisher year 19 5 3 3 y1 t1
Quilt Query Sampler II Q2 FOR $author IN DISTINCT document("bib.xml")//author, $book IN document("bib.xml")//book[author = $author] RETURN <result> $book/title, $author</result> b1 b2 b3 a1 a2 a1 a3 a2 variable nodes v2 [author= v1] v1 2 7 v1 /bib/book/author a1 b1 7 7 v2 a2 b2 2 /bib/book[author= v1] 2 a3 b1 return nodes b3 t1 v2 / title b2 5 5 t2 t2 t1 7 v1 7 5 7 a1 t1 a2 t2 a3 t1 t3 /bib/book t2 /result 2 author r1 r2 r3 r4 r5 7 5 7 v1 v2 / title /bib/book/author t1 a1 t2 a1 t1 a2 t3 a2 t2 a3 /bib/book[author= v1]/title
Quilt Query Sampler III Q3 <results> FOR $author IN DISTINCT document("bib.xml")//author RETURN <result> $author, document("bib.xml")//book[author = $author]/title </result> </results> b1 b2 b3 a1 a2 a1 a3 a2 v1 7 a1 variable nodes a2 v1 /bib/book/author a3 7 7 t1 t2 7 5 return nodes a1 t1 t1 7 7 v1 a2 t2 5 t2 a3 5 /bib/book[author= v1]/title t1 t3 t2 /result /bib/book/author r1 r2 r3 7 7 5 a1 t1 t2 a2 t1 t3 a3 t2
Quilt Query Sampler IV Q4 <books-with-prices> FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"], $b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title] RETURN <book-with-prices> $b_book/title, <price-amazon>$a_book/price/text()</price-amazon>, <price-bn>$b_book/price/text()</price-bn> </book-with-prices> </books-with-prices>
Quilt Query Sampler IV b1 b2 b3 b1’ b2’ b3’ variable nodes v1 2 bib/book[source = "www.amazon.com"] 2 v2 t1 t2 t3 t1’ t2’ t3’ 2’ bib/book[source = "www.bn.com"][title = v1 /title] 2’ return nodes t1 5’ v2 / title v1 v2 [title = v1 /title] 5’ 2 2’ t2 22 22 v1 /price/text() b1 b1’ 22’ t3 b2 22’ v2 /price/text() b2’ b3 b3’ t2 t1 t3 22 5 22’ $12.5 t1 $21 $23 / book-with-prices /bib/book t2 $22 @source = "www.amazon.com" $54 t3 $47 price-bn 2 price-amazon 5’ /bib/book @source = "www.bn.com" 2’ 22 22’ PCDATA PCDATA t1 t2 t3 $12.5 $21 $23 $22 $54 $47
More Quilt Query Sampler variable nodes Q5 <bib> FOR $book IN document("bib.xml")//book[price.<=.$50] RETURN <book year=$book/@year><editors>$book/editor</editors></book> </bib> v1 /bib/book 2 2 condition nodes f1 v1 / price.<.=$50 21 21 return nodes t1 v1 @ year 3 3 t2 12 v1 / editor variable nodes 12 v1 /bib/book 2 2 v2 7 /bib[book= v1]//author 7 condition nodes f1 v1 / price.<.=$50 21 21 Q6 <bib> FOR $book IN document("bib.xml")//book[price.<=.$50], $author IN /bib[book=$book]//author[last=“Abiteboul”] RETURN <book>$book/title, $book/price, $author</book> </bib> return nodes t1 v1 / title 5 5 t2 v1 / price 21 21 t3 7 v2 7
s s s q F T F T F T s s s s s s T F T F T F q q q q q q q q Query Containment for Relational Queries
Our Containment Theorem Given a set of cached queries S={s1,s2...}, and a new query q, q can be fully answerable by S if 1 2 3 4 5 6
Explanations for every condition node f of q, it must either also be one condition node, with loose predicates, of some si in the cache, or be one return node of some sj for every condition node fi of q, if it is not one of any return node of sj, then it must be one condition node, with loose predicates, of some si in the cache, and any other condition node fk of si should be one condition node fk of q or there is a subset of S, whose condition nodes is a subset of those of q, but whose condition nodes and return nodes are a superset of the condition nodes of q.
Explanations for every return node t of q, it must also be one return node of some si in the cache. for every pair of return nodes ti and tj of q, if their counterparts are in different segments si and sj, then there must be a common return node in si and sj. for every return node t of q, if it is derived from a variable node v, then its counterpart in the cache should be also derived from the same variable node, and all the condition nodes derived from this v should also have their counterparts derived from v in q.
Query Rewriting Rules If a query is judged to be computable by cached views, the following rules can be followed to figure out the rewritten q’ • 1. Decide which filters to keep(not evaluated by any cached query yet), • which filters to remove (evaluated by some cached query) and • remember those cached queries S with established F mappings. • keep all those f that has t’ matches, and those f with a looser f ’ matches, • they would be still appearing as condition nodes in the probe query • remove those f with exact f ’ matches • for each non-exact f ’ match, remember its s so to know which s to associated • with those left over filters
Query Rewriting Rules (cont.) • 2. We need to figure out the semantic meanings of newly constructed nodes • in the returning structure of each cached queries, they are associated • with new xpaths as the replacement of their old ones • …. • A newly constructed node can be seen as the renaming of some old • element node. Return nodes usually appear under each newly constructed • node, hence a mapping of this new node to the old one can be inferred • from those return nodes • replace in the new query q those old xpaths, with the new xpath to a • newly constructed node in cached views • 3. In case of a query rewriting using joins of more than one s with common • t pair, be sure to add such joins as new conditions • if there is no variable binding in the new q, a new binding should be produced • for one of the common t pair so that there is a way to join with its pair
Query Containment I Suppose that we have queries of q1,q2,q3,q4,q5,q6 cached in C={s1,s2,s3,s4, s5,s6}, a new query q comes in, • caseq of • <bib> • FOR $book IN document("bib.xml")//book[editor/affiliation=“WPI”] • RETURN <book year=$book/@year>$book/title</book> • </bib> • it does not even satisfy the first condition. s.t. not answerable • f1 refers to the element node of 17, which has no match in s1 to s6 • <bib> • FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"] • RETURN <book year=$book/@year>$book/title</book> • </bib> • it satisfies the first condition, but not the second one. s.t. not answerable • f1 < -- > f1’ in s1, but another condition node f2’ of s1 is not any condition node of q f1 f1
Query Rewriting I • caseq of • <bib> • FOR $book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] • RETURN <book year=$book/@year>$book/title</book> • </bib> • it satisfies all those conditions, s.t. is answerable • 1) f1 < -- > f1’ in s1, f2< -- > t2’ of s1, f3 < -- > f2’ in s1, 2) there is no other f ’ in s1, • 3) t1 < -- > t1’ in s1, t2< -- >t2’ in s1, 4) t1’ and t2’ are both from the same s1, • 5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2 f1 f2 f3 s1 $book IN …/bib/book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] /book v1 /bib/book rewritten as t1 v1 @ year t2 v1 / title 5 3 /book [source = "s1”] Rewrite the query as <bib> FOR $book IN /book [source = "s1"][@year=1997 AND title like “JAVA*”] RETURN <book year=$book/@year>$book/title</book> </bib> left over filters
f1 f2 f3 Query Rewriting II • caseq of • <bib> • FOR $book IN … /bib/book [@year.>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50] • RETURN <book>$book/title,<editors>$book/editor</editors></book> • </bib> • it satisfies all the conditions, s.t. is answerable • 1) f1 < -- > f1’ in s1, f2 < -- > f2’ in s1, f3< -- > f1’ of s5, 2) there is no other f ’ in s1 and s5 • 3) t1< -- >t2 in s1, t2 < -- > t2 in s5, 4) t1 in s1 = t1 in s5, • 5) t1 and t2 are derived from v1, so do t1’ and t2’ from v1’, v1’--> f1’, f2’, and v1--> f1, f2, f3 $book IN …/bib/book [@year .>=.1991 AND publisher="Addison-Wesley” AND price.<=.$50] s1 /book v1 /bib/book rewritten as t1 v1 @ year t2 v1 / title 5 3 /book1[source = "s1”] and /book2[source = "s5”] s5 Rewrite the query as <bib> FOR $book1 IN /book [source = "s1"], $book2 IN /book [source = "s5"][title =$book1/title] RETURN <book>$book1/title,<editors>$book2/editor</editors></book> </bib> /book v1 /bib/book t1 v1 @ year t2 5 3 v1 / editor
Query Rewriting III Suppose that we have q2 cached in S, but q3 is not cached, instead, it is a new query, caseq of <bib> FOR $book IN document("bib.xml")//book[publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book> </bib> <bib> FOR $book IN … //book [@year=1997 AND title like “JAVA*” AND publisher="Addison-Wesley"] RETURN <book year=$book/@year>$book/title</book> </bib>
Query Rewriting IV Suppose that we have q2 cached in S, but q3 is not cached in, instead, it is a new query, Q4 <books-with-prices> FOR $a_book IN document("prices.xml")//book[source = "www.amazon.com"], $b_book IN document("prices.xml")//book[source = "www.bn.com"][title = $a_book/title] RETURN <book-with-prices> $b_book/title, <price-amazon>$a_book/price/text()</price-amazon>, <price-bn>$b_book/price/text()</price-bn> </book-with-prices> </books-with-prices>
Input: Query q, Semantic Cache C Output: Result of q AnsweringQuery Procedure: answerable, fullAns <--- False; T <--- current timestamp; C={s1,s2,..} <--- set up CIS for si segment; s <--- set up CIS for q; M <--- matched node set, set as null at the beginning ; R = RENs <--- not matched node set; RC <--- CENs of s; si <--- look for the first q related si in C; S <--- put si into a candidate set; While (si can be found) { answerable <--- True; STs <--- T; (MS, RM) <--- query_trimming(si, s); MS <--- matching nodes of s and si; RM <--- remaining nodes of s not covered by si; R = R-RM; M = M+MS; RC = RC+RemainingCENS; If (R=null) { fullyAns <--- True; break; } si <--- next q related segment in C; } PQ <--- query_rewriting(S, JoinCISs, M, RC); MatV <--- materialized view sets referred by S; ResPQ, Result <--- process PQ against MatV; If (fullyAns = False) { RQ <--- query_rewriting(S, JoinCISs, M, RC); ResRQ <--- process RQ at the server; Result <--- coalesce(ResPQ, ResRQ); } create a new segment Snew contains the result of q; SnewTs <--- T If there isn’t enough space, do cache replacement Cache Snew ; return(Result).
Input: CIS structure for q s and a segment si; current candidate set S and JoinCISs; Output: judge whether si is related to q; if yes, add into S, matching nodes MS and remaining nodes RM; Query_trimming Procedure: IsRealted <--- False; MS <--- null; RM <--- boundRENs of s; RS <--- RENs in si; RNS <--- RENs in all sof S; RemainingCENS <--- boundCENs of s; While (RM=\null) { matchingRENS <-- match nodes in RM and RSby applying theorems; if (matchingRENS =\null) { commonSet <--- RNS ^ RS; if (commonSet =\null) { IsRelated <--- True; <si, sj, commonSet> <--- sj is the segment in S has commonSet with si; JoinCISs <--- add <si, sj, commonSet> into JoinCISs; (MS, RM) <--- (matchingRENS, RM-M); RemainingCENS <--- left-over ones except exact-match CENS; } } } return (S, JoinCISs, MS, RM, RemainingCENS).
DTD DTDWalker DTDTree (ElementNode) ENIndex to-be-cached queries set up QueryDecomposer QueryIndex VENIndex CENIndex RENIndex (VariableEleNode ReturnEleNode ConditionEleNode) CacheIndexStructs MatView (ViewDTDTree) new query QueryDecomposer AnsweringQuery NewQuery QueryTrimmer NewQuery QueryRewritter contained? (MatchingCENPair ProbeQuery MatchingRENPair MatchingENPair) Y Result fully contained? remainingRENs N QueryRewritter RemainderQuery ResultCoalescer
ENIndex dtdElements:Vector 1 ElementNode enId:int dtdRef:DTDTree enName:String absXpath:String parentEN: ElementNode childrenEN:Vector DTDTree dtdName:String dtdLoc:String rootEN: ElementNode elementNodes:Vector equalTo(EN):boolean relativeXpathForm(EN):String VariableEleNode ConditionEleNode ReturnEleNode ViewDTDTree venName:String cisRef:CIS parentVEN:VEN childrenVEN:Vector childrenCEN:Vector childrenREN:Vector cenName:String cisRef:CIS condition:String parentVEN:VEN renName:String cisRef:CIS parentVEN:VEN matchingENPairs:Vector stricterThan(CEN):boolean
CENIndex QueryIndex RENIndex VENIndex cachedQueries:Hashtable cachedRENinCISs:Hashtable cachedCENinCISs:Hashtable cachedVENinCISs:Hashtable 1 ProbeQuery boundVENs:Vector boundCENs:Vector boundRENs:Vector newDTD:ViewDTDTree 1 NewQuery 1 CacheIndexStructs candidateCISSet:Vector matchingCENPairs:Vector remainingCENs:Vector matchingRENPairs:Vector remainingRENs:Vector initialize cisId:String qstring:String viewDTD:ViewDTDTree matRef: MatView boundVENs:Vector boundCENs:Vector boundRENs:Vector 1 hasMoreCENsThan(CIS):boolean hasOverlapTENs(MatchingRENPair, MatchingRENPair):boolean MatchingRENPair MatView MatchingCENPair newTEN: ReturnEleNode oldTEN: ReturnEleNode otherOldTENs:Vector matId:String cisRef:CacheIndexStructs newCEN: ConditionEleNode oldCEN: ConditionEleNode oldREN: ReturnEleNode remainingCond:string diffCISFrom(MatchingRENPair):boolean overlapTENWith(MatchingRENPair):boolean sameParentVEN():boolean newWithPVENhasMoreCENs():boolean MatchingENPair newConstructEN:ElementNode originalDtdEN:ElementNode source: MatView stricterThan(CEN):boolean
Timetable • By 11/15: design due, implement starts • By 11/30: half finish coding • By 12/10: fully finish coding • By 12/20: finish integration • By 1/15: test designed cases • By 1/30: design and do experiments • By 2/15: collect experiment results • By 2/28: document code, writing • By 3/15: summarize
Task Assignment • Lily: • design classes, containment, rewriting and candidate picking algorithms, design experiments • ideas for query decomposition, result combination, cache decomposing /coalesce, replacement policy, data updates handling • Jake: • implement containment algo, ... • Ian: • implement classes of EN, VEN, CEN, TEN, ... • Amar: • module of rewriting algo
Implementation Toolsuites • JDK1.2, Servlet • XML Parser, DTD Parser • Quilt Parser, Kweelt(Quilt) Query Engine