520 likes | 696 Views
Calculus and algebra for distributed data management Serge Abiteboul INRIA-Futurs and Univ. Paris 11. Outline. Introduction Thesis Logic for distributed data management Algebra for distributed data management Conclusion. Introduction. Success stories after the Internet bubble.
E N D
Calculus and algebra for distributed data management Serge AbiteboulINRIA-Futurs and Univ. Paris 11 Serge Abiteboul - Stacs 2007
Outline • Introduction • Thesis • Logic for distributed data management • Algebra for distributed data management • Conclusion Serge Abiteboul - Stacs 2007
Success stories after the Internet bubble • Google: Web index • Bearshare, etc.: music • Amazon: book catalogue • YouTube: videos • eBay: product catalogue • Flickr: pictures • Wikipedia: dictionary • del.icio.us: annotations • Mapquest: maps • MySpace: Web pages • InFrance: Meetic: dating database & Kelkoo: comparative shopping They are all about publishing some database Serge Abiteboul - Stacs 2007
The trends: peer-to-peer and interactivity • Switch from centralized servers to communities and syndication • Peer-to-peer: A large and varying number of computers cooperate to solve some particular task without any centralized authority • seti@home; kazaa; cabal • Interactivity and Web 2.0 • Motivations: Social, organizational Serge Abiteboul - Stacs 2007
Content sharing community: the data ring • Joint work with Alkis Polyzotis (UCSC) • Content sharing community: A group of users that share and query information within some domain • Shared information is heterogeneous, distributed, and dynamic • Users are not database savvy • Based on large body of previous research • Each peer exports data or services • The ring supports declarative queries over the shared resources • Challenge: Enable non-experts to easily create and maintain content sharing communities Serge Abiteboul - Stacs 2007
The data ring is self-administrated • No experts • The users of the system, e.g., scientists, are not experts • No central authority that can be responsible for administration • No centralized servers • Requirements • Ease of deployment (zero-effort) • Ease of administration (zero-effort) • Ease of publication (epsilon-effort) • Ease of exploitation (epsilon-effort) • Participation in community building notably via annotations Happy info admin Serge Abiteboul - Stacs 2007
What should be made automatic • Self-statistics from the monitoring of the data ring • Logs and statistics on system operation • Models of system performance • Self-tuning based on the self-statistics • Enrichment of physical layer with access structures • Decide to install access structures: indexes, views, etc. • Control replication of data and services • Self-healing • Recovery from peer and network failures • Recovery from unexpected anomalies • Monitoring and surveillance • And automatic file management Serge Abiteboul - Stacs 2007
A mainframe database A file system Web server A PC A PDA A telephone A sensor A home appliance A car A manufacturing tool A telecom equipment A toy Another data ring What is a peer? Any connected device or software with some information to share Serge Abiteboul - Stacs 2007
Why P2P? • It is easy to get access to lots of processing power • Cpu, disk, memory, network • Hardware is cheap • Lots of available hardware that is not used most of the time • What can we do with this processing power? • Simulate life (cell, heart, gene, etc.), climate, etc. • Build new services with all the information available on the net • Advantages of P2P Disadvantages Performance Complexity Scalability Updates and transactions Availability Quality of Services Cost Access rights Serge Abiteboul - Stacs 2007
Examples • Personal & family data management • Pda, phone, pc, home appliance, car, tv… • Data management in a scientific group • Experiments and simulations generate huge quantity of data • Google search in P2P • Taxonomy • Volume of information • Number & volatility of peers • Quality of service Serge Abiteboul - Stacs 2007
To do what? Answer queries precisely • Query: what is the email of the prime minister of France ? • Yesterday’s Web: a human asks the query, gets a list of pages and browse them to find the answer • Tomorrows Web: • To: ? France’s prime minister ? • my Webmail finds • DominiqueDeVillepin@premier.gouv.fr • How: with more semantics • The web site of government should specify the meaning of web pages and services Serge Abiteboul - Stacs 2007
Semantic is essential for the Web This aspect will be ignored here We talk here about the “easy” part The semantic Web Serge Abiteboul - Stacs 2007
Data exchange format = XML Labeled, unranked, ordered trees Distributed computing protocol = Web services Query languages = XPath and XQuery Knowledge representation = Owl or RDF/S Web support for distributed data management Owl RDFS XML SOAP WSDL Xquery Xpath Serge Abiteboul - Stacs 2007
Uniform access to information… …the dream for distributed data management
A standard for XML queries: Xquery • A “logic” for labeled, ordered, unranked tree • – a declarative language • Inspired by SQL: standard for relation data • Inspired by OQL: standard for object databases • Functional as OQL • Not as clean • Mixes structure and content – information retrieval • Give me the documents where the word XML appears in title • Some full-text extension is coming • Also an update language • A language for XML in a centralized repository not for distributed data management Serge Abiteboul - Stacs 2007
Main impact of mathematical logic in computer science Slogan: First-order logic on the everybody’s desk A huge industry (Oracle server, IBM DB2, MS Access…) Crux: specify declaratively your needs, not by some complicated code Easier to specify Cleaner code Optimizable queries The success of databases First-order logic Tarski/Coddd’s algebraïzation Rewrite-based optimization Relational systems Serge Abiteboul - Stacs 2007
We should do similarlyfor distributed information management! • The success of the relational model, i.e., of 2D-tables on a server : • A logic for defining tables • An algebra for describing query plans over tables • By analogy, we need for trees in a P2P system • A logic for defining distributed tree data and data services • An algebra for optimizing queriesover trees/services • XQuery is fine for local XML processing and publishing but not for distributed data management • On-going work – ActiveXML – Serge Abiteboul - Stacs 2007
Guidelines for logic and algebra • Manage trees in a distributed setting • Mention explicitly the topology if desired • Ignore it if preferred • Support for streams • Essential for subscription services • Also necessary to support recursion • Handle both extensions and intensions • Extensional information: e.g., documents and xml pages • Intensional information (views): web services • Seamless transition between them • Looking in a document (a Web page) • Calling a database (a Web service) Serge Abiteboul - Stacs 2007
Active XML:a logic for distributed data managementJoint work with Omar Benjelloun (Google),Tova Milo (Tel Aviv) and many others
The basis • AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework • Simple idea: XML documents with embedded service calls • Intensional data • Some of the data is given explicitly whereas for some, its definition (i.e. the means to acquire it when needed) is given • Dynamic data • If the data sources change, the same document will provide different information Serge Abiteboul - Stacs 2007
Example(omitting syntactic details) <resorts state=‘Colorado’> <resort> <name> Aspen </name> <sc> Unisys.com/snow(“Aspen”) </sc> <depth unit=“meter”>1</depth> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> … </resorts> • May contain calls • to any SOAP web service : • e-bay.net, google.com… • to any AXML web services • to be defined Serge Abiteboul - Stacs 2007
ActiveXML: XML documents with embedded service calls Music@p1 Music@p2 r1@p1 Music@p2 Music@p3 r2@p2 r r r1 r2 r1 r1 r2 r2 s s t s t s t at s at s at Peer p1 Peer p2 Serge Abiteboul - Stacs 2007
Marketing Philosophy Active answer = intensional and dynamic and flexible Embedding calls in data is an old idea in database Manon: What’s the capital of Brazil? Dad: Let’s ask Wikipedia.com! Manon: How do I get a cheap ticket to Galapagos? Dad: Let’s place a subscription on LastMinute.com! Manon: What are the countries in the EC? Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more! Serge Abiteboul - Stacs 2007
What is an AXML peer? Any connected device or software with some information to share Serge Abiteboul - Stacs 2007
When to activate the call? Explicit pull mode: active databases Implicit pull mode: deductive databases Push mode: query subscription What to do with its result? How long is the returned data valid? Mediation and caching Where to find the arguments? Under the service call: XML,XPATH or a service call A key issue: call activation Serge Abiteboul - Stacs 2007
Another key issue: what to send? • Send some AXML tree t • As result of a query or as parameter of a call • The tree t contains calls, do we have to evaluate them? • If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data? • Hi John, what is the phone number of the Prime Minister of France? • Find his name at whoswho.com then look in the phone dir • Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr • (33) 01 56 00 01 Serge Abiteboul - Stacs 2007
A nice problem: casting • Given an ActiveXML document d (with Web service calls) • Given a type t, can we cast d to t? • Alternation of states (pick next service to call) and states (the adversary chooses the answer) • Undecidable in general • Very efficient casting based on unambiguous grammars • Related work: Active Context-free Games [MuschollSegoufinSchwentick04] Serge Abiteboul - Stacs 2007
Blasphemous claim: ActiveXML is the proper paradigm for data exchange! Not XML + not XQuery Brings to a unique setting distributed db, deductive db, active db, stream data warehousing, mediation This is unreasonable? Yes! Plenty of works ahead… to make it work But first, the algebra Active XMLa cool idea & some complex problems Serge Abiteboul - Stacs 2007
Active XML algebrafor distributed data managementJoint work with Ioana Manolescu (INRIA-Saclay)
Motivation • Relational model: centralized tables • optimization: algebraic expression and rewriting • Active XML model: distributed trees • optimization: algebraic expression and rewriting • Distributed query optimization based on algebraic rewriting of Active XML trees • Based on experiences with AXML optimization Serge Abiteboul - Stacs 2007
Why an algebra? Specify a query declaratively Compile it into a distributed query plan Optimize the query plan in a distributed manner Exchange query plans between peers Example: title of songs by Carla Bruni? ActiveXML algebra Serge Abiteboul - Stacs 2007
We focus on positive AXML Set-oriented data Positive/monotone services Services = tree-pattern-query-with-join queries Services produce streams Optimized by a local query optimizer Evaluated by a local query processor Out of our scope Active XML peers output stream π Local query processing join π input stream input stream Serge Abiteboul - Stacs 2007
The problem • An AXML system • A set of peers • For each peer a set of documents and services • Extensional data is distributed • Intensional data (knowledge) is distributed • Defined using query services (TPQJ queries) • These services are generic: any peer can evaluate a query • A query q to some peer • Evaluate the answer to q with optimal response time Serge Abiteboul - Stacs 2007
The AXML algebra • Captures distributed XML query processing/optimization • Based on a communication model a la CCS • Algebraic – stream-oriented • Orthogonal to the local XML query optimizer • Orthogonal to the network support (DHT, small world etc.) • What is not yet available? A cost model and heuristics Serge Abiteboul - Stacs 2007
AXML algebra l En … E1 E2 s@p En … E1 E2 send@p #n@p’ E1 eval@p receive@p E1 E1 • (AXML) algebraic expressions: AXML logic d@p Each such expression lives at some peer Includes the AXML trees Serge Abiteboul - Stacs 2007
The problem • An AXML system • A set of peers • For each peer a set of documents and services • Extensional data is distributed • Intensional data (knowledge) is distributed • Defined using query services (TPQJ queries) • These services are generic: any peer can evaluate a query • A query q to some peer • Evaluate the answer to q with optimal response time Serge Abiteboul - Stacs 2007
Algebraic expressions annotations • Executing service call: • Terminated service call: • Subtlety q@p(5): definition of intensional data eval(q@p(5)): request to evaluate it; during query optimization q@p(5): query is being evaluated; during query processing q@p(5): query evaluation is complete Serge Abiteboul - Stacs 2007
Evaluation rules: local rules l l eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p eval@p → … tn … t1 t2 → t1 t2 tn t2 t1 tn E1 ●s@p E1 → s@p … tn … t1 t2 for l ≠ sc, s ≠ send, receive Serge Abiteboul - Stacs 2007
Evaluation rules: transfer rules #x@p #x@p newRoot()@p’ eval@p receive@p eval@p’ send@p’ s@p’ s@p’ s@p’ … … … t1 t1 t1 t2 t2 t2 → & #x@p • Site p asks p’ to do the work and send the result to p Serge Abiteboul - Stacs 2007
Synchronous #x@p newRoot()@p’ receive@p eval@p’ #x@p send@p’ s@p’ s@p’ … … … t1 t1 t1 t2 t2 t2 eval@p #x@p s@p’ PEER P PEER P’ Serge Abiteboul - Stacs 2007
Asynchronous #x@p newRoot()@p’ receive@p eval@p’ #x@p send@p’ s@p’ s@p’ … … … t1 t1 t1 t2 t2 t2 eval@p #x@p s@p’ PEER P PEER P’ Serge Abiteboul - Stacs 2007
Simulation of asynchronous communications #x@p newRoot()@p’ newRoot()@p’ receive@p eval@p’ eval@p’ #x@p send@p’ send@p’ s@p’ s@p’ s@p’ … … … … t1 t1 t1 t1 t2 t2 t2 t2 eval@p #x@p #x@p s@p’ PEER P PEER P’ NETWORK Serge Abiteboul - Stacs 2007
Evaluation • Reminder: setting • An AXML system • A request to evaluate query q at peer p – eval@p( q ) • Rewrite the trees in peer workspaces until termination of the process • Results • For positive XML, this process converges… to a possibly infinite state • This process computes the answer to q • May be fairly inefficient: need for optimization! Serge Abiteboul - Stacs 2007
q = t s=“Bruni” ( ri ) where = outer join q query plan (a) r2[s,at] r3[t,s] r4[t,s] r1[t,s] r5[t,s] query plan (b) q q q q r1[t,s] r3[t,s] r4[t,s] r5[t,s] query plan (c) q q q q r5[t,s] r1[t,s] r3[t,s] r4[t,s] Serge Abiteboul - Stacs 2007
Links to deductive databases • Analogies • extensional relations XML • intensional relations service calls • Recursion: P calls P’ that calls P • Detection of termination • Query optimization: adaptation of Vieille’s QSQ (same for MagicSet) [AbiteboulAbramsMilo] • Used for distributed network diagnosis (with Haar) Serge Abiteboul - Stacs 2007
What is available? • Data ring • Paper in Cidr07 • Some on-going work on self tuning • Logic for distributed data management – ActiveXML • Survey paper to appear in VLDB Journal • Code in open source • Algebra – ActiveXML algebra • Paper in EDBT06 is out of date • New paper available • Implementation started • P2P indexing – KadoP • Code in open source Serge Abiteboul - Stacs 2007
Lots of related work and related systems • This is going very fast in system devepments • Structured P2P nets: Pastry, Chord • Content delivery net: Coral, Akamai • XML repositories: Xyleme, DBMonet • Multicas systemst: Avalanche, Bullet • File sharing systems: BitTorrent, Kazaa • Pub/Sub systems: Scribe, Hyper • Distributed storage systems: OceanStore, GoogleFS • Etc. • Fundamental research is somewhat left behind Serge Abiteboul - Stacs 2007