560 likes | 716 Views
Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com http://www-rocq.inria.fr/verso http://www.xyleme.com . Organization. Motivations XML Typing XML Querying XML XML and the Web Illustrations: 2 problems
E N D
Semistructured data -- June 2001 Semistructured data:from practice to theorySerge AbiteboulINRIA & Xyleme SASerge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.comhttp://www-rocq.inria.fr/verso http://www.xyleme.com
Organization • Motivations • XML • Typing XML • Querying XML • XML and the Web • Illustrations: 2 problems • Incomplete information • Xyleme • Conclusion
Semistructured data -- June 2001 Motivations
Motivation: Complex data • Structure is irregular (missing/extra data…) • Schema does not exist or is unknown • Schema is rapidly evolving • Relational and ODB models are too rigid • Example: BibTex, HTML, SGML, XML, ASN.1, STEP/Express…
Ontology meta-data Mediator wrapper wrapper wrapper wrapper wrapper wrapper Source Source Source Source Source Source Complex data: mediation User Many data sources coming and going
Motivations: The Web today • Terabytes of data • Private web: not publicly available pages • Deep web: data hidden behind forms • A lot of public pages • Standard is a document/hypertext language HTML
Browsing Search engines in: list of words out: sorted list of URLs Applis: hand-made wrappers Expensive Incomplete Short-lived, not adapted to the Web constant changes The Web today [Raghavan ’00]
A new standard XML • HTML is not appropriate for data exchange on the Web • Standard database models are too constraining for the Web • The solution: a semistructured data model XML • Reminder: a data model consists of a type definition language, a query/update language + more
Semistructured data -- June 2001 The most successful semistructured data model: XML
The origin of XML • Parents • SGML • Relational and OO databases • SGML: markup language for documents • HTML and the Web: billions of pages • Not appropriate for data exchange • XML eXtensible Mark-up Language • W3C and most industrial companies [B2B] • Main idea: separate content and presentation • Use tags to represent structure and semantics
HTML XML comes from SGML – also hypertext language – semistructured data fixed number of tags – not fixed content and presentation – not mixed are mixed very difficult to extract data – much easier from a page old standard for the Web – new standard XML: documents + databases
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>. Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 Information System HTML HTML = Hypertext Language hard Text + presentation Where is the data ?
Ref Name Price X23 Camera 359.99 R2D2 Robot 19350.00 Z25 PC 1299.99 ... Information System XML = Semistructured Data <product-table> < product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description> </product> < product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description> ... </product-table> easy Data + Structure Semistructured: more flexible XML
XML: example <dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year></ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars> <NewCars> <ad> <model>R406</model> </ad> </NewCars> </dealer> dealer UsedCars NewCars NewCars ad ad ad model year model model Honda 96 Acura R406 It is just an unranked tagged ordered tree
XML • Tree or graph • Data and structure/semantics are mixed • Tags contain typing information • Core constructor is list of tag/value pairs • Details • Each node may have an arbitrary number of children with distinct or not tags • Nodes also have attributes that are unordered and unique per node • Standard means to represent cyclic data: Id Idrefs
XML Very active/noisy field - standards • types (DTD/XML schema), style-sheet (XSL), resource description (RDF...) • DOM, SAX… • WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)... • How fast will XML conquer the web? • so far rather slow (about 1% now of the visible web; much more in intranets); accelerates (e.g., with Explorer 5.5)
Semistructured data -- June 2001 Typing XML
Typing XML • This is heresy for the freedom of the Web • Essential for data management: query optimization, user interfaces, applications • Differences with standard database typing • Collections are sequences instead of sets • Types may be very large (e.g., from integration) • Data is more irregular so types should be more permissive • New issues sometimes: you have the data, extract its type, an approximate type
Intuition : the type is a tree dealer • Semantics and structure are in paths • dealer/UsedCars/ad • dealer/UsedCars/ad/model UsedCars NewCars ad ad model year model text text text
DTD: a grammar Catalog Product* Product Name Price? Cat (Part Quantity)* Part BasicPart + ComposedPart BasicPart Pame ComposedPart Name (Part Quantity)* • Nice and simple • Shortcoming: type of an element is independent of its context
dealer dealer UsedCars NewCars UsedCars NewCars adused ad adnew ad model year model year model model More complex: specialization • Type of ad depends on its context • One way to view it: homomorphism
Regular tree automata • Set of accepted trees: regular tree languages • Definable in monadic second-order logic dealer q0 Acceptance: there is a computation such that all leaves are labeled qf Used New p q ad ad ad ad r r s s m y m y m m qf qf qf qf qf qf • variants: top/down bottom/up, nondeterminism, unranked trees
DTDs+specialization Result: DTDs+specialization = regular tree languages • Closure (intersection, union, complement) • Tests for validation, inclusion • Static analysis
Situation today • Many people are using DTDs • Nice and simple in spite of ugly syntax • New proposal: xml-schema • More powerful but too complicated? • Other proposals: Relax, Trex • Usually based on some kind of regular tree automata • From experience: one will win and not necessarily the best
Semistructured data -- June 2001 Query languages for XML
Query Languages for XML • Extensions of SQL • first-order-logic • Information retrieval keyword search • Navigation via regular expression + pattern matching Lorel, XML-QL, XMAS… • Structural recursion UnQL, XSLT… • No official winner – leader is Xquery
Tree with variables and constraints Pattern matching between the query and the data Each match provides a valuation for X,Y,Z Pattern matching catalog product X Y name price cat=elec <200 Z subcategory
Example in Lorel select <offer> Z/name, P/name, P’/price </offer> from P in catalog/product, Z in discount_stores/store, Z/storecatalog/product P’ where P/category=“camera” and P/make=“canon” and P’/id = P/id • Joins like in relational databases • Construction of complex results • Regular expressions for paths (e.g., W/*/name = “Gates”)
What is new in XML queries • A bit new: limited recursion (like in deductive databases) • A bit new but no big deal: constructed answers (like in OODB) • Very new: ordered data • Bothering • Theoretical base is a bit messy: FO, tree automata, bisimulation • No yardstick like relational calculus/algebra
Proposal : k-pebble transducers stack [milo,suciu,vianu]
root a c a a b b b a k-pebble transducers: result
Semistructured data -- June 2001 XML and the Web
Why it is the same old story • Massive amounts of data • Providers export data, users access data • Query languages, indexing, optimization • Database paradigm: still effective on the Web
Why it is not the same old story • Databases • rigid structure • transactions, concurrency control • data independence • controlled (e.g., known cost model) • coherent system, very • polished artifact • The Web • flexible, no schema • flexible protocols • fuzzy separation • perfect mess (and that’s why people like it?) • closer to a natural ecosystem!
The principles of the Web • The uncertainty principle: you can never be sure of anything or that the data is consistent • The incompleteness principle: they do not give you all the data you want (but some you don’t want :-) • The chaos principle: you can rarely assume the existence of some global schema • The instability principle: everything keeps changing Every piece of data you got is probably wrong, incomplete, does not conform to its expected type and is probably already stale
What can be reused? • Some technology? indexes, B-trees, distributed query processing (concurrency control and transactions not yet) • Database theory? little • Algebra and rewrite rules for optimization • Dependency theory • First order and other logics • Seems that because of the ordering, it opens the gates for many more tools such as regular/tree languages
Metaphor [AV]: the Web is infinite • What are the pages pointing to my homepage? • Google solution: milliseconds – stale data • Freeze the Web: weeks to get exact answer • Exact answer: no means to get it • Leads to reconsider the notion of computation
Computability • Finitely computable: give the answer in finite time • All pages reached from my HP in less than 3 links • Eventually computable: each solution is given in finite time; computation may be infinite • All pages reached from my HP • Not computable • Can my HP be reached starting from my HP? • Also: approximate, partial, stale, pipelined answers
Tough life: the Web is huge • Relational calculus/algebra: logspace data complexity (also AC0) • What is the data complexity of an Xquery of the Web? • Complexity of computing on the Web • Logspace in the Web? • Need to trade quality for performance
The Web keeps changing • Classical: versions, temporal queries • Less classical: monitoring of the Web [Xyleme] • Smart crawling of the Web: flow of docs • Query subscription: query on this flow • Continuous queries • What is the underlying theory?
Semistructured data -- June 2001 Illustration: incomplete information Work with Victor Vianu
Example Access to an electronic catalog Q1: name, subcat, price of electronic products with price less than $200 Q2: name, pictures of cameras at least pictured once
* product product product * missing product2 product1 canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer catalog Q1: name, subcat, price of electronic products with price less than 200
Missing data after Q1 product1 product2 * * name price cat picture name price cat picture =elec >200 !=elec subcategory subcategory
3 3 c.jpg akai a.jpg elec camera * product1 catalog product2 * product2b * product2c missing product product product product2a canon 120 elec nikon 199 elec sony 175 elec camera camera cdplayer Q2: name, pictures of cameras at least pictured once
product + product2a Missing data name pricecat picture =elec product1 >200 * subcategory no picture name price cat picture product3 !=elec no picture subcategory name price cat product2c elec product2b subcategory * namepricecat !=camera =elec >200 namepricecatpicture =elec >200 Known data subcategory subcategory !=camera
After two queries • Known information: • Prefix of the real data tree • Missing information • Complex type • Q3: name, price, pictures of cameras costing less than $100 and at least pictured once • can be completely answered using A1, A2 • Q4: list all cameras • can be partially answered using A1, A2
Semistructured data -- June 2001 Illustration: Xyleme
A dynamic warehouse of Web data • Warehouse • Xyleme stores huge quantities of data (teraB) • Xyleme is not a search engine (only index) or a mediator(only virtual data) • XML • Xyleme is focused on XML • Dynamic • Xyleme is interested in data evolution/changes
Technical Challenges 1. Data Acquisition and Maintenance discover data of interest and maintain it up to date • Repository store and index this data 3. Efficient query Processing 4. Semantic Integration provide a simple view of each semantic domain 5. Change Control Monitor the web and offer services such as Query Subscription