430 likes | 564 Views
Representing and Querying XML with Incomplete Information. Serge Abiteboul INRIA. Luc Segoufin INRIA. Victor Vianu UCSD. Organization. Motivations Simplifying assumptions Model of incompleteness Answering queries Results Discussion Conclusion. Motivations.
E N D
Representing and QueryingXMLwith Incomplete Information Serge Abiteboul INRIA Luc Segoufin INRIA Victor Vianu UCSD
Organization • Motivations • Simplifying assumptions • Model of incompleteness • Answering queries • Results • Discussion • Conclusion Abiteboul-Segoufin-Vianu
The Web is a world of incompleteness • Information you get from the web is seldom complete: • Queries return you some - not all - data • Limited storage capability • Documents change on the Web: expiration • Sites are unavailable… • Context: A warehouse of XML documents from the Web, Xyleme Abiteboul-Segoufin-Vianu
This work • This work: simple, practically appealing approach to managing incomplete information • Sequence of queries to the web • (q1,A1)+(q2,A2)+… • Answers are cached • Process a new query without access to the web • Give an incomplete answer • Explain incompleteness to user • Seek additional information, i.e., find minimal set of queries to fully answer Abiteboul-Segoufin-Vianu
Related works • Semantic caching • Answering queries using views • keep (Qi,Ai) • try to rewrite query Q into Q’(A1,...,An) • reject if you cannot • Incomplete database • (Qi,Ai) is some incomplete knowledge of DB • Related to querying incomplete information – e.g. Lipski-Imielinski Abiteboul-Segoufin-Vianu
Challenge: balance expressiveness and tractability • Choice of data model • Choice of the query language • Choice of a representation of incompleteness • Results • Simple, practical solution • Extra features lead to serious problems Abiteboul-Segoufin-Vianu
Data is XML: trees <dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year> </ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars> </dealer> dealer UsedCars NewCars ad ad model year model Honda 96 Acura Abiteboul-Segoufin-Vianu
unordered trees catalog labelling function value function product product =c.jpg name price category name price cat picture =nik =234 =electronic =can =444 =electronique subcategory subcategory =camera =camera Simplified XML Abiteboul-Segoufin-Vianu
Simple XML types catalog 1 : 1 child (default) * : 0 or more + : 1 or more ? : 0 or 1 * product * name price cat picture subcategory Abiteboul-Segoufin-Vianu
Prefix Selection Queries (ps-queries) catalog catalog Query1 Query2 product product name price cat=elec name picture <200 subcategory Abiteboul-Segoufin-Vianu
Data No order No distinction attribute/element No recursion No links Query No complex path expressions No join No repeated child Simplifications product name cat=elec cat=toy Abiteboul-Segoufin-Vianu NO
prod &245 prod &245 &245 prod + = c.jpg canon 120 elec canon 120 elec c.jpg camera camera Crucial assumption: XID • URLs • ID/IDrefs Abiteboul-Segoufin-Vianu
Set of rules: e r e element name r regular expression Set of trees satisfying a DTD d: tree(d) Shortcoming of DTDs An element has a single definition independently of the context Type of ad depends on the context Document Type Definition (DTD) are used to represent incompleteness dealer usedcar newxar ad ad model year model Abiteboul-Segoufin-Vianu
adused and adnew h(adused)=h(adnew )=ad Solution: specialization (decoupled tags) dealer dealer usedcar newxar usedcar newxar h adused adnew ad ad model year model model year model Abiteboul-Segoufin-Vianu
DTDs + Specialization The sets of trees that can be specified: the regular unranked tree languages [Bruggeman—Klein+Murata+Wood] • Same closure properties: intersection, union, complement • Same complexity Abiteboul-Segoufin-Vianu
Example Q1: name, subcat, price of electronic products with price less than $200 Q2: name, pictures of cameras at least pictured once ---------------------------- Q3: name, price, pictures of cameras costing less than $100 and at least pictured once can be completely answered using A1, A2 Q4: list all cameras can be partially answered using A1, A2 Abiteboul-Segoufin-Vianu
* product product product * product1 product2 canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer catalog missing Q1: name, subcat, price of electronic products with price less than 200 Abiteboul-Segoufin-Vianu
Missing data after Q1 product1 product2 * * name price cat picture name price cat picture =elec >200 !=elec subcategory subcategory Abiteboul-Segoufin-Vianu
product1 * 3 3 c.jpg akai a.jpg elec camera catalog product2 * product2b * product2c missing product product product product2a canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer Q2: name, pictures of cameras at least pictured once Abiteboul-Segoufin-Vianu
Incomplete information • Known information • Prefix of the real data tree • Missing information • Extended tree type • Conditions on data values • Specializations, disjunctions Abiteboul-Segoufin-Vianu
product + product2a Missing data name pricecat picture =elec product1 >200 * subcategory no picture name price cat picture product3 !=elec no picture subcategory name price cat product2c elec product2b subcategory * namepricecat !=camera =elec >200 namepricecatpicture =elec >200 Known data subcategory subcategory Abiteboul-Segoufin-Vianu !=camera
Complete answer to Q3 • Q3: name, price, pictures of cameras costing less than $150 and having at least one picture • Can be fully answered using available information • Need to check whether answer is complete catalog prod canon 120 c.jpg Abiteboul-Segoufin-Vianu
price>200 and no picture more products name Incomplete answer to Q4 • Provide known cameras • Explain incompleteness akai canon nikon sony Abiteboul-Segoufin-Vianu
Completing answer to Q4 • It suffices to ask: product 0 name price cat picture =elec >200 sub=camera Abiteboul-Segoufin-Vianu
Revisit the types • DTD • Conditions • Specialization: same element name may have several types • Not sufficient • Need to extend again the types: disjunctions product2b * namepricecatpicture =elec >200 subcategory !=camera Abiteboul-Segoufin-Vianu
Query1’ Query2’ Disjunction vehicle vehicle engine data data vehicle ? sail engine data description ? &322 sail vehicle Empty! description data=“….” description=“….” Abiteboul-Segoufin-Vianu
Disjunction continued • Type of &322 vehicle1 + vehicle2 vehicle1 vehicle2 engine data data sail description description The type of &322 can not be described independently of that of data below Abiteboul-Segoufin-Vianu
Representation of information Set of possible worlds T rep(T) rep q q Set of possible answers q(rep(T)) = rep(q(T)) Representation of result q(T) rep Representation System:Lipski’s+Imielinski’s Abiteboul-Segoufin-Vianu
Representation System for PS-queries • Incomplete tree T to represent q1-1(A1) … qk-1(Ak) • PS-query q • q(T) can be computed in ptime (representation of the answer can be computed in ptime) Abiteboul-Segoufin-Vianu
Querying Incomplete Trees • Given T and a query q, one can • Give in ptime the sure answers up to our current knowledge • Check in ptime whether query q can be fully anwered • Generate in ptime queries to complete answer Abiteboul-Segoufin-Vianu
Relational model Relational calculus/algebra Conditional table Closed or open world Representation system XML tree model Weaker language (no join) Weaker system (no variable) + Closedandopen World Representation system Comparison with IL Abiteboul-Segoufin-Vianu
Drawback: exponential blowup • Incomplete information may become exponential w.r.t the sequence of query/answer q1/A1;q2/A2… database database qi: Type: 1 1 b b=i a a=i Answers are empty Abiteboul-Segoufin-Vianu
Dealing with exponential blowup • Make the representation more complex using disjunctions of types • Size of representation stays polynomial • Manipulations much more complex • Restrict tree types and PS-queries • Already very/too? simple • Accept to loose some information • Ask extra queries to simplify representation Abiteboul-Segoufin-Vianu
Discussion: extend language • Some results in paper • Extensions often lead to intractability • E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL • No (known) representation system • Testing rep(T) is empty is non-elementary Abiteboul-Segoufin-Vianu
Discussion : node Ids Without node Ids • much less information to integrate results • more complex • tedious case analysis Abiteboul-Segoufin-Vianu
Discussion: ordering • Ordering in XML, DTD, queries • Problem is totally different and very complex • Example: • Q1/A1: list of males; Q2/A2: list of females; Q3: list all • Depending on the type of input • (Male)*(Female)* A3= A1 || A2 • (Male Female)* A3= shuffle(A1,A2) • (Male + Female)* we cannot answer A3 • Regular expression processing Abiteboul-Segoufin-Vianu
Conclusion • Framework for acquiring, maintaining, querying incomplete XML data • Limitations: • simple queries • no order and Id assumption • small extensions lead to problems • Possible to represent the incompleteness • Possible to answer with incompleteness • Possible to obtain queries to provide full answer Abiteboul-Segoufin-Vianu