510 likes | 618 Views
New Bases for New Data Omar Benjelloun Stanford University. January 27th, 2006. Relational databases are great. A simple, understandable model for data High-level, declarative language for queries and updates: SQL Efficient optimization techniques
E N D
New Bases for New DataOmar BenjellounStanford University January 27th, 2006 Omar Benjelloun - New Bases for New Data
Relational databases are great • A simple, understandable model for data • High-level, declarative language for queries and updates: SQL • Efficient optimization techniques • Relational databases are the cornerstone of the management of homogeneous, regular, exact, centralized information Omar Benjelloun - New Bases for New Data
… but data has changed • Data is distributed, behind applications, dynamically changing • Data is heterogeneous • Data may be uncertain • Today • Data is stored in relational databases (or XML) • Techniques for data integration, data exchange • … Lots of code • Traditional Database Management Systems (DBMS’s) are too rigid • New characteristics should be represented in the data • New bases are needed • foundations (models and languages) • Processing and optimization techniques Omar Benjelloun - New Bases for New Data
Applications • Information integration • Data is distributed on multiple heterogenous, independent sources • Conflicting information from the sources: inconsistency, uncertainty • Varying and evolving reliability of sources • Wheredata came from can be critical information • Scientific data management • Receptor (e.g., sensor) data management • Data cleaning (entity resolution) • And many others… Omar Benjelloun - New Bases for New Data
Agenda • Distributed and dynamic data: Active XML • A “glue” language to connect data and programs • XML documents with embedded calls to Web services • Distributed interactions through the exchange of AXML data • Techniques to query and control the exchange of AXML data • Uncertain data: ULDB’s • An extension of the relational model with uncertainty and lineage • Efficient query evaluation • Computing probabilities • Conclusion Omar Benjelloun - New Bases for New Data
Distributed data managementInformation is everywhere Web service XML XML XML services XML services Internet services XML XML XML XML Data warehouses Databases Web sites PC, PDA, cell phones, home appliances, cars… Web service services Omar Benjelloun - New Bases for New Data
The golden triangle of distributed data management • XML • a standard for data representation & exchange • Extensible Markup Language • Labeled ordered trees • Rich types: XML Schema • Query languages • XPath, XQuery • Web services • Standards for distributed computing XML • XQuery XPath SOAP WSDL Omar Benjelloun - New Bases for New Data
What is Active XML (AXML)? • AXML is a declarative language • for distributed information management • and • an infrastructure to support this language, • in a peer-to-peer framework. Omar Benjelloun - New Bases for New Data
Active XML documents • XML documents with embedded calls to Web services • Intensional • Some of the data is given explicitly • Some is given intensionally (i.e. the means to acquire data when needed are given) • Dynamic • If the external sources change, the same document will provide different information • Reaction to world changes Omar Benjelloun - New Bases for New Data
Not a new idea in databases, nor on the Web • Mixing calls to data is an old idea • Procedural attributes in relational systems • Basis of Object-oriented Databases • In Web programming • Sun’s JSP, PHP+MySQL • Calls to Web services inside documents • Macromedia FLEX, Apache Jelly, Microsoft XAML • What is new is the exploitation of the idea… Omar Benjelloun - New Bases for New Data
Web services in brief • A number of standards • XML • SOAP: Exchange of messages between applications • WSDL: Description of service interfaces (e.g. input/output types) • UDDI: Advertisement and discovery of services • … other proposed standards (choreography, security, etc.) • For us: means to provide, invoke and describe remote functions with XML input/output. • They make AXML documents universally understandable. Omar Benjelloun - New Bases for New Data
A sample AXML document city newspaper title date GetTemp GetEvents “Exhibits” “06/10/2003” “Paris” “Le Monde” <?xml version=“1.0” ?> <newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call> </newspaper> • AXML documents may contain calls: • to any existing Web services (e-bay.net, google.com…) • to any AXML Web services (to be defined) Omar Benjelloun - New Bases for New Data
Materialization date city newspaper title temp GetTemp GetEvents “Exhibits” “Paris” “16°C” Y! <?xml version=“1.0” ?> <newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call> </newspaper> • Replacing the call by its result is not the only option • Calls are not necessarily RPC-style synchronous invocations “06/10/2003” <temp>16°C</temp> “Le Monde” SOAP call Omar Benjelloun - New Bases for New Data
AXML Web services • Parameters: AXML data • Result: AXML data • Distribute computations:by sending as parameters data containing service calls, one can delegate some work to other peers. • Partial computations:by returning data containing service calls, one can give to the receiver the control of these calls. Great flexibility Omar Benjelloun - New Bases for New Data
Distributed interactions Omar Benjelloun - New Bases for New Data
title date newspaper title date city newspaper city temp temp GetEvents GetEvents GetTemp GetTemp “Exhibits” “Exhibits” “06/10/2003” “06/10/2003” “Le Monde” “Le Monde” “Paris” “Paris” “16°C” “16°C” Y! To call or not to call ? • Materialization can be performed • by the sender, before sending a document… • or by the receiver, afterreceiving it. Omar Benjelloun - New Bases for New Data
Why control the materialization of calls? • For added functionality, e.g. • Intensional data allows to get up-to-date information. • For security reasons or capabilities, e.g. • I don’t trust this Web service/domain, • I don’t have the right credentials to invoke it, • It costs money, • Maybe the receiver doesn’t know Active XML! • For performance reasons, e.g. • A proxy can invoke all the services on behalf of a PDA. • … and many more reasons you can think of! Omar Benjelloun - New Bases for New Data
How to control it? Using types Receiver Sender Capabilities ACL Cost ... Capabilities ACL Cost ... • We extend XML Schema, withintensional types: XMLSchemaint g data exchange Schema q f g f q ... ... g g g q f r r g f ... q g g q ... r ... ... ... ... • Static analysis algorithms use signatures of services:WSDLint Omar Benjelloun - New Bases for New Data
The extended schema language city newspaper title date GetTemp GetEvents “Exhibits” “06/10/2003” “Paris” “Le Monde” To simplify, we use here a DTD-like syntax • Data: • newspaper = title.date.(GetTemp|temp).(GetEvents|exhibit*) • title = data • date = data • temp = data • city = data • exhibit = title.(GetDate|date) • Functions: • GetTemp(city) -> temp • GetEvents(data) -> (exhibit|performance)* • GetDate(title) -> date • Rewriting: replace call(s) by anarbitraryoutput of the service. Omar Benjelloun - New Bases for New Data
Rewritings • The Goal: Given • an AXML document d • a schema s, Can we rewrited so that it matches s? • Safe rewriting: one that for sure leads to s • (we know without making any call) • Possible rewriting: one that may lead to s(depending on the answers of services) Omar Benjelloun - New Bases for New Data
Difficulties • Infinite search space • Vertical • Horizontal • Main problem • The result of a Web service call is unknown • We just know a signature (input/output types) • We want a very efficient solution • Foundations of the problem • String & tree automata, • with existential and universal transitions. Omar Benjelloun - New Bases for New Data
Results • The general problem is undecidable [MSS03] • Restrictions on the considered rewritings • Left-to-right: No “going back and forth” • K-depth: bound on the nesting of function calls (Search space still infinite but finitely representable) • Under these restrictions • We have algorithms to find safe/possible rewritings. • They are PTIME(for deterministic schemas). • We can also do it between schemas. • Implementation • demo at VLDB 2003 (customizable news syndication) Omar Benjelloun - New Bases for New Data
Safe rewriting algorithm (flavor) title date GetTemp GetEvents • Build an FSA that accepts all k-depth rewritings of the initial word. • Build an FSA that recognizes the complement of the target type. q3 q1 q4 q2 q0 temp q5 q6 q7 exhibit performance * * * * GetEvents * title date temp p0 p4 p6 p1 p2 p3 * * exhibit p5 exhibit Omar Benjelloun - New Bases for New Data
Safe rewriting algorithm exhibit q4,p6 q7,p5 q4,p5 exhibit performance performance exhibit GetEvents exhibit performance q7,p6 q3,p6 q7,p3 q4,p3 q7,p6 GetTemp title date GetEvents q1,p1 q2,p2 q3,p3 q4,p4 q0,p0 temp q5,p2 q6,p3 • Compute the intersection of these languages: • A smart marking determines whether a safe rewriting exists. • Then run the word on the marked automaton to find an actual rewriting. • Optimizations: lazy construction of the automata • parallel evaluation of calls Omar Benjelloun - New Bases for New Data
Querying AXML Data getDate City exhibits city temp newspaper title “19°C” • Given a (tree pattern) query: • /newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”] • Materialize the document? • Call only the services that may contribute • data to the query answer. • The problem: Lazy evaluation of service calls • To call or not to call, this time when evaluating a query GetEvents GetTemp “Exhibits” GetExhibits “Paris” “Le Monde” “Paris” Omar Benjelloun - New Bases for New Data
Lazy evaluation • Difficulties: • Calls can be found everywhere in the document • May appear dynamically (as a result of previous calls) • May become (ir)relevant due to previous invocations • Need to take signatures of calls into consideration • A possible approach: modify the query processor • Top-down evaluation • Trigger the calls found on the way • Not so great: • Computation is blocked • Optimization opportunities are lost Omar Benjelloun - New Bases for New Data
NFQ’s temp newspaper exhibit exhibits temp newspaper location exhibits * * * > 18°C • Given a query to evaluate: • Derive a set of • “node-focused” queries (NFQ), • that find the relevant calls • when evaluated on the document. • Need to be reevaluated, as the document evolves! > 18°C “Le Louvre” Etc. Omar Benjelloun - New Bases for New Data
Optimizations • Service calls sequencing • Analysis of the relationship between calls (through the NFQ’s) • Layering, and parallelization inside each layer. • Filtering by type analysis • Match output types of services to the data expected by queries • “Pushing” queries to capable services • Acceleration: • Via relaxation: • NFQ approximation • Superset of the relevant calls • Via a special access structure, similar to a DataGuide: • Restricted to paths that lead to service calls • Indexes the calls • Experimental assessment • 10x speed-up when combining optimizations Omar Benjelloun - New Bases for New Data
There is more… • The AXML peer system • Manages persistent AXML documents • Provides AXML services • Open source • Language extensions to control the activation of calls • Continuous services • Theoretical foundations • …check out http://www.activexml.net Omar Benjelloun - New Bases for New Data
Basic Premise • Traditional relational DB • Every data item’s value must be exact • Every data item is in the database or not • Where data came from and how it evolves is not important • ULDB’s relax these constraints by making • Data • Uncertainty • Lineage all first-class interrelated concepts Omar Benjelloun - New Bases for New Data
Previous work • Models for uncertainty • Labeled nulls, c-tables, probabilistic models,... • Trade-off between • expressiveness • Simplicity of representation, complexity of operations • We investigated this space in [DBHM06] • Models for lineage • In relational databases, data warehouses • Definition of lineage can be tricky for complex queries • First to consider lineage together with uncertainty Omar Benjelloun - New Bases for New Data
Uncertainty alternate • Possible worlds: ? maybe x-tuple • Simple formalism • not complete • not closed under joins Omar Benjelloun - New Bases for New Data
Lineage witness, suspect Omar Benjelloun - New Bases for New Data
ULDB’s ? ? ? ? Omar Benjelloun - New Bases for New Data
ULDB’s ? ? ? ? Omar Benjelloun - New Bases for New Data
Properties • ULDB’s are simple • x-tuples: set of alternate tuples, with or without ‘?’ • lineage: associates with each alternate a set of alternates / external symbols • ULDB’s are expressive • Complete: can represent any finite set of possible worlds (with lineage) • Simple implementation of monotonic queries, with correct lineages • Natural probabilistic extension • ULDB’s are efficient • Query processing can use existing query optimizers • Tuple certainty/membership can be tested in polynomial time Omar Benjelloun - New Bases for New Data
Querying ULDB’s Algorithm Possible worlds Query semantics D1, D2, …, Dn Q(D1), Q(D2), …, Q(Dn) Q(Di): add query result as new relation and lineage to Di D Q(D) ULDB’s Relational databases(with lineage) Omar Benjelloun - New Bases for New Data
Algorithm ? • Granny • BMW ? • Kid • Ford • Granny • BMW • Kid • Ford witness, suspect ? ? ? • Kid • Mike ? Omar Benjelloun - New Bases for New Data
Properties • Efficient algorithm • Query processing phase can use standard query optimizer • Lineages are easy to propagate • “Grouping” phase requires a single pass on the result • Initial prototype • represents a ULDB as a relational DB • uses simple query rewriting techniques • Algorithm works for any monotonic query (including SPJU queries) Omar Benjelloun - New Bases for New Data
Probabilistic ULDB’s 0.3 0.2 0.5 0.3 0.7 • Semantics: As before, with a probability for each possible world • Without lineages • Alternates of the same x-tuple correspond to disjoint events • Alternates of different x-tuples correspond to independent events • Lineages • Capture correlations • Help propagate probabilities for query results ? Omar Benjelloun - New Bases for New Data
Probabilistic query answering • Compute queries as before • Compute probabilities on demand • Traverse lineages transitively to the leaves • Combine probabilities of reached alternates • Optimizations: memoize probabilities, efficiently detect ‘closest independent ancestors’ ? ? ? ? ? 0.2 0.3 0.4 0.1 0.3 0.5 1 Omar Benjelloun - New Bases for New Data
Future work • Richer queries • Duplicate elimination, difference, aggregation • Supported through new kinds of lineages (e.g., disjunctive, negative) • Querying the uncertainty and the lineage • More operations • Updates (and their lineage), close to versioning • “Uncertain operations”, e.g., entity resolution, inconsistency repairs • More optimization techniques • More theory Omar Benjelloun - New Bases for New Data
New “Bases” for new data • The database way • Simple models • Declarative languages • Optimization techniques • … for new features of data • Distribution and decentralization: Active XML • Uncertainty and lineage: ULDB’s • There are more challenges • Real-world side effects, semantic reasoning • and strong requirements • security, privacy, personalization • Big challenge: Doing it all in a coherent way • One “big” model? • Integration of models? Omar Benjelloun - New Bases for New Data