320 likes | 425 Views
WebdamExchange and WebdamLog : some models for web data management Alban Galland INRIA Saclay & ENS Cachan. Grenoble, 10/12/2010. Organization. Introduction Representing all Web information as logical sentences Representing all Web data management as logical rules
E N D
WebdamExchange andWebdamLog: some models for web data managementAlban GallandINRIA Saclay & ENS Cachan Grenoble, 10/12/2010
Organization • Introduction • Representing all Web information as logical sentences • Representing all Web data management as logical rules • Some clues about implementation • Conclusion
Context of the work presented here • ERC Grant Webdam on Web Data Management of Serge Abiteboul with two INRIA teams, Leo-Iasi (ex Gemo, INRIA Saclay) and Dahu (LSV, ENS Cachan) • Joint work with many people: Émilien Antoine, Serge Abiteboul, Meghyn Bienvenu, David Gross-Amblard, Amélie Marian, Bruno Marnette, Neoklis Polyzotis, Philippe Rigaux, Marie-Christine Rousset…
Context: Web data management • Scale: lots of users, servers, large volume of data… • Distribution heterogeneity: Cloud (social networks), P2P (DHT, gossiping)… • Security heterogeneity: login, https, crypto, hidden URL… • Terminology heterogeneity: annotation, semantic Web, ontologies… • Incomplete information: inconsistencies, belief, trust… • The heterogeneity keeps increasing with new systems and new applications arriving • Consequence 1: difficulty to perform data integration/management • Consequence 2: impossibility to keep control over its own data
Thesis: Web data = distributed knowledge • Work plan • Represent all Web information as logical sentences • Represent all Web data management as logical rules • Develop a system to validate these ideas • Motivation for the approach • Facilitate the design/implementation of complex systems • Facilitate the control/surveillance of complex systems • Use reasoning to optimize query evaluation • Use reasoning for semantics/ontologies • Use reasoning to manage access control and protect data • Use reasoning to analyze properties of systems
Motivating example • Alice : get me the pictures of my friends where I am with Bob? • What is going on: • Find the friends of Alice (The iPhone of Alice may remember it) • For each answer, say Sue, find where Sue keeps her pictures (She may keep her pictures on Picasa) • Find the means to access Sue’s pictures (Alice may ask the private url to a common friend) • Find the photos with Bob and Alice (e.g. by querying the meta-data)
Motivating example • Alice : get me the pictures of my friends where I am with Bob? • Issues: heterogeneity of friends • Heterogeneity of hosting: Some keep their pictures on trusted servers such as Picasa, some put in on untrusted DHT, some have them on their smartphones… • Heterogeneity of access-control: Some are public, some use login-password, some use private url, some use cryptography… • Heterogeneity of data description: they may use different models of meta-data (taxonomies, ontologies…)
The information belongs to someone • Each information belongs to a principal • A principal has an identity (URI) which can be authenticated • Two kinds of principal: peer and virtual principal • A peer: alice-laptop, alice-iPhone, picasa, facebook, dht-peer-124, … • Storage and processing capabilities • A peer typically has a URL and can be sent query/update requests • A virtual principal: alice, alice-friends, roc14 • A virtual principal relies on peers for storage and processing
The kind of information we are talking about • Data: pictures, movies, music, emails, ebooks, reports • Localization: bookmarks, knowledge such as Alice has an account in Facebook, Sue puts her pictures in Picasa • Access: login/password, access rights on servers • Annotations /Ontologies: semantic tags in Picasa ,RDFS, OWL • Services: search engines, yellow pages, dictionaries… • Incomplete information: beliefs, probabilistic information… • And more…
Logical statements to represent information • Data: • Document: picture34@alice-iPhone(picture34.jpg,09/12/2009,…) • Collection: pictures@alice(picture34@alice-iPhone) • Localization: where@alice(picture37, picasa/alice) • Access right: isOwner@picasa/alice(alice) • Access secret : ownSecret@picasa/alice(“alice”, “HG-FT23”) • Ontologies: isA@yago.com(“alice”, human-being) • Services: addresse@pagesjaunes.fr($Person, $City, $Y) • Belief: picture34@alice-iPhone(picture34.jpg,09/12/2009,…,75%) • Etc.
WebdamExchange focus: authenticated knowledge • Base statement: • someone states picture37@alice (….) • It is annotated with a proof that “someone” can write data of alice • In the cryptographic setting, it is a signature of the whole statement using the write secret key of alice • Keeping trace of provenance: • alice-laptop states picture37@alice (….) requester bob at 12:30, 10/08/2009 • alice-Laptop is the performer (the peer who did the update of the data of Alice) • bob is the requester (the peer or the user who requested the update) • The content is possibly encrypted: • alice-laptop states picture37@alice (….) protected for reader@alicerequester bob at 12:30, 10/08/2009
WebdamExchange focus: authenticated knowledge • Communication: external knowledge is knowledge about other principals: • alice-laptop says (alice-laptop states picture37@Alice (….) requester bob at 12:30, 10/08/2009) to sue-iphone at 13:15, 15/10/2009 • alice-laptop is the performer of the communication • sue-iphone is the receiver of the communication • External knowledge is authenticated by the performer and is stored by the receiver . • The external knowledge keep a trusted trace of the provenance and communication are pilled-up: • sue-iphone says (alice-laptop says (alice-laptop states picture37@Alice (….) requester bob at 12:30, 10/08/2009) to sue-iphone at 13:15, 15/10/2009) to bob-iphone at 13:10, 15/10/2009 • The time is the time of the performer, there is no global clock
The model covers a wide range of data • The model does not prescribe any particular architecture for distribution • Gossiping, DHT, centralized server • Combination of these • Based on an abstract notion of localization • The model does not prescribe how access control is enforced, e.g.: • Documents in Web servers with access protected by login/password • Documents protected by cryptographic keys in public sites • Based on an abstract notion of secret and hint
Summary of WebdamExchange • All the information forms a trusted knowledge base • Each peer manages some portion of the knowledge base • Now, we have to use this distributed knowledge base … for the management of the distributed knowledge base!
From WebdamExchange to WebdamLog • The logical part of the WebdamExchange statements can easily be translated into datalog facts. • Most of the reasoning of the system can be done using the logical form and datalog-like rules • It motivates WebdamLog, a rule-based language for web data management
Why datalog? • Datalog: very popular in the 90’s, prehistory by Web time • Nicer/more compact syntax; easy to extend • Recursion not really essential • Datalog extensions • Negation and aggregate functions tons of works on that • Updates, time, trees, distribution fewer works on it • We use a datalog-like language influenced by • Active XML for distribution and intensional data • Hellerstein’sDedalus for time and performance
Webdamlog • Facts are of the form: m@p(a1,...,an) (sorted) • Rules are of the form: • R@P(U) :- (not) R1@P1(U1), …, (not) Rn@Pn(Un) • R,Ri are message terms • P,Pi are peer terms • U,Ui are tuples of terms • Safety condition • Intuition: if the body holds for some valuation v, the message vR@vP(vU) is sent to the peer vP • Issue: what happen if the body of the rules mentions different peers?
Webdamlog System: • A finite set of peers • Each peer p in has a local programP(p) and some delegated program D(p) consisting of finite sets of rules • Each peer p in has a database I(p), consisting of a finite set of facts of the form m@p(u) Semantics: • in a state (P,D,I), choose randomly some p • Evaluate (P(p)UD(p))(I(p)) • This defines the new database I’(p) • This adds facts and update rules of the other peers to define (D’(q),I’(q)) for each q • The changes to each q are installed synchronously – we will see how to avoid it if desired • Choose another peer and keep going (in a fair way) Peer2 Peer1 Peer3 Peer4
Features of WebdamLog illustrated • Alice: get me the pictures of my friends where I am with Bob? • result@alice-iphone($photo,$X) :- friends@alice-iphone($X),findPhotos@alice-iphone($X,$R,$P), $R@$P($Photo,$Meta), contains@$P($Meta, “Alice”) , contains@$P($Meta, “Bob”) • Peers and messages as data: they are reified • friends@alice-iphone is extensional, in I(alice-iphone) • findPhotos@alice-iphone is intensional, in P(alice-iphone)UD(alice-iphone) • $R@$P is bounded to a relation of (possibly) another peer • contains@$P is a service of that peer
Features of WebdamLog illustrated • Delegation of rules • Alice: get me the pictures of my friends where I am with Bob? • result@alice-iphone($Photo,$X) :- friends@alice-iphone($X),findPhotos@alice-iphone($X,$R,$P), $R@$P($Photo,$Meta), contains@$P($Meta, “Alice”) , contains@$P($Meta, “Bob”) • friends@alice-iphone(Sue); • findPhotos@alice-iphone(Sue,photos,picasa/sue) :- • Then alice-iphone installs the following rule at picasa/sue: • result@alice-iphone($Photo,Sue) :- photos@picasa/sue($Photo,$Meta),contains@picasa/sue($Meta, “Alice”) , contains@picasa/sue($Meta, “Bob”) • picasa/sue will send the photos as extensional facts to alice-iphone. When Alice terminates her query, it cancels all the delegations.
Managing rules at other peers • This is complex • Regarding implementation, one manages instantiations of rules, i.e., rules and valuation • The content of valuations may be constantly changing • There could be some negations in the rules • This is a security risk • Someone else is installing data (facts) or code (rules) in a peer • Need to control that carefully
Does it means something? • Some not-so trivial theorems about positive case or stratified negation case insuring • Church-rosser properties (convergence) • Natural simulation by centralized systems • Some even-less-trivial theorems about comparing expressivity of different variations of WebdamLog: without exchanging rules, without exchanging intensional data, with time-stamp…
More refined asynchronicity • To model message from peer p to peer q, we may use a “peer” netpq that captures the network • Replace a call m@q(u) at p by m@netpq(u) • netpqshould just relay messages: $M@q($U) :- $M@netpq($U) • Problem: all messages from p to q in the net arrive at the same time • Better with time • m@netpq(u,t) where t is the time of the send at p • $M@q(U) :- $M@netpq(U,T), min( T , $M@netpq(U,T)) , using min aggregate function
Summary of WebdamLog • Peer are asynchronically running their datalog programs • They exchange facts and delegations of rules
Implementation • We are implementing two kinds of peers • WEP (Webdam Exchange Peer) – all functionalities • IWEP (iPadWebdam Exchange Peer) – limited functionalities; rely on proxies • We are implementing a social network on top of the system
Some cool results and still a lot of works • WebdamExchange and WebdamLog models capture some nice problems of web data management: distribution, access control… • Their good semantics allow us to prove theorems! • We are implementing the corresponding system! • Many issues are still open • Concurrency, optimization, implementation • Defining and verifying protocols (access control is not violated, one gets all the information one has access to) • Looking for a killer application