620 likes | 735 Views
Data Ring: Let us turn the net into a database! Serge Abiteboul , INRIA-Futurs Neoklis Polyzotis, UC Santa-Cruz. Introduction. Success stories after the Internet bubble. Google: management of Web pages Mapquest: management of maps Amazone: book catalogue eBay: product catalogue
E N D
Data Ring: Let us turn the net into a database!Serge Abiteboul, INRIA-Futurs Neoklis Polyzotis, UC Santa-Cruz ADBIS06, Data ring
Success stories after the Internet bubble • Google: management of Web pages • Mapquest: management of maps • Amazone: book catalogue • eBay: product catalogue • Napster (emule, bearshare, etc.): music database • Flickr: picture database • Wikipedia: dictionary They are all about publishing some database ADBIS06, Data ring
A trend is towards peer-to-peer • P2P: • A large and varying number of computers cooperate to solve some particular task without any centralized authority • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network • Examples • seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net • cabal: decryption of 512 bits RSA code • grub: P2P Web search ADBIS06, Data ring
A trend is towards peer-to-peer • The Web is switching from centralized servers to communities and syndication • Buzzwords such as Web 2.0, infoware, annotations, interactivity • Some motivations • Social, organizational: no need to have a leader, more flexible • Technical: performance, availability, no bottleneck • Analogy: • Software development by very structured and controlled groups of programmers, e.g., Windows • Open-source software produced by large communities of autonomous developers, e.g., Linux ADBIS06, Data ring
Data ring • Information management in a P2P network • Information is heterogeneous, distributed, replicated and dynamic • Data + knowledge + services • Peers are heterogeneous, autonomous and possibly mobile • Heterogeneous storage, bandwidth, computing power • From sensors to PDA to mainframe • Variety of requirements in terms of number of peers, QoS, performance, security, etc. • We will see motivations and examples ADBIS06, Data ring
Acknowledgement – people & projects • Xyleme: Sophie Cluet, Guy Ferran & many others • Scalable XML repository • A start-up in 2000 • ActiveXML within EC DbGlobe project: Omar Benjelloun, Ioana Manolescu, Tova Milo & many others • Framework for 2P data management • KadoP within EC Edos project: Ioana Manolescu, Nicoleta Preda & many others • P2P scalable XML repository ADBIS06, Data ring
Outline • Introduction • Data ring • XML query processing • P2P XML query processing • Logic and algebra for distributed data management • Automatic and distributed management • Other issues in brief • Conclusion ADBIS06, Data ring
Functionalities:DBMS, content management and more • Storage, persistence, replication • Indexing, caching, query, update, optimization • Schema management, access control • Fault tolerance, self tuning, monitoring • Semantic enrichment, information retrieval, uncertain data, result ranking, multi-lingual… • Resource discovery, history, provenance • Each functionality may be achieved by a peer or by the network • This may depend on the nature of the application ADBIS06, Data ring
The information in a data ring • Data: tuples, collections, documents, relations, objects… • Services: data sources, possibly some computation (e.g. genome) • Knowledge • Meta-data about resources: attribute/values pairs, annotations • Ontologies to explain data and metadata • Mapping between ontologies (for data integration) • Views (possibly recursive) • Physical data • Indices • Materialized views • Statistics for query optimization ADBIS06, Data ring
A mainframe database A file system Web server A PC A PDA A telephone A sensor A home appliance A car A manufacturing tool A telecom equipment A toy Another data ring And now, what is a peer? ? Any connected device or software with information to share + a net address + names for its resources d@p, s@p’ ADBIS06, Data ring
Example: Edos distribution system • A system for the management of Linux distribution • Joint work with Mandriva Software and U. Tel Aviv • Community of open-source developers: thousands • System releases: about 10 000 software packages + meta data • Package name, author; date, signature, documentation • Collections of packages, dependencies between packages • Functionalities • Query the metadata (also query subscription) • Retrieve packages • Update an existing release • Publish a new release (based on BitTorrent) ADBIS06, Data ring
Main parameters of such applications • Number of peers • Number of documents • How volatile the peers are • The query workload • The update workload • The “network distance” between the peers • Edos: peers and documents in thousands, mostly append for updates, peers not too volatile • An extreme: Google search engine in P2P for billions of documents using millions of volatile peers ADBIS06, Data ring
Lots of related work and related systems • Structured P2P nets: Pastry, Chord • Content delivery net: Coral, Akamai • XML repositories: Xyleme, DBMonet • Multicast systems: Avalanche, Bullet • File sharing systems: BitTorrent, Kazaa • Pub/Sub systems: Scribe, Hyper • Distributed storage systems: OceanStore, GoogleFS • Data spaces: Franklin, Halevy ADBIS06, Data ring
Standards: The golden triangle of distributed content management on the Web • Standard for data exchange • XML, XML Schema… • Extensible Markup Language • Labeled ordered trees • Foundations: tree automata • Query languages • XPATH, XQuery… • Standards for distributed computing: Web services • SOAP, WSDL, UDDI, BPEL… • Simple Object Access Protocols • Corba but simpler and on the Web XML SOAP WSDL Xquery Xpath ADBIS06, Data ring
Information used to live in islands but it is changing • Different formats: relational, metadata, documents, text, DXF • A Web standard for data exchange, XML, is fixing it • XML captures all kinds of information over a wide spectrum • XML comes with a family of emerging standards: XML schema, XSL/T, Xquery, domain specific schemas… • Different computers, platforms, languages, applications • A standard for Web services, SOAP, is fixing it • SOAP allows ubiquitous computing on the Internet • SOAP comes with a family of emerging standards: WSDL, UDDI • This provides a uniform access to information… • …the dream for distributed data management ADBIS06, Data ring
Why XML? Because of its information spectrum Minimal structure Structured Data Hierarchy + Meta data Books Contracts Catalogs Bank accounts Emails Financial Reports Insurance Policies Economical Analysis Derivatives Inventory Political analysis Insurance Claims Financial News Sports News Resumes ADBIS06, Data ring
A standard for information: XML • Labeled ordered trees where leaves are text • Marriage of document and database worlds • Marriage of full text indexing and structure indexing • Is it the ultimate data model? No • Purely syntax – more semantics needed • Is it OK for now? Definitely yes (because it is a standard) ADBIS06, Data ring
The main asset of XML (vs. HTML) = typing product-table product reference price designation description • Applications need typing • XML typing: DTD and XML schema • Trees • Semantics and structure are in tags and paths • product-table/product/reference • product-table/product/price ADBIS06, Data ring
A standard for distributed computing: Web services • Possibility to activate a method on some remote Web server • Exchange information in XML: input and result are in XML • Ubiquitous XML distributed computing infrastructure • 2 main applications • E-commerce • Access to remote data • With XML and Web services, it is possible • To get information from virtually anywhere • To provide information to virtually anywhere ADBIS06, Data ring
A standard for XML queries: Xquery (soon a recommendation of W3C) • A “logic” for labeled tree – a declarative language • Inspired by SQL: standard for relation data • Inspired by OQL: standard for object databases • Functional as OQL • Not as clean as OQL • Mixes structure and content • Give me the documents where the word XML appears in title • Some full-text extension is coming • Also an update language and full-text capabilities ADBIS06, Data ring
XML and Xquery • XML focuses on lists • Represent a very set of objects as a list • In our context, overhead in terms of index maintenance and query processing • Here we are mostly concerned with a “set-oriented” view of XML • Xquery is very complicated and powerful • OK for large documents on a server • Not appropriate for huge volumes of small documents on the Web • Here we are mostly concerned with tree pattern queries (a subset of XQuery) and full-text search ADBIS06, Data ring
So to be honest • We take only what we want in the standard • Ignore order • Ignore complex features of XQuery • And even that is not what we need… ADBIS06, Data ring
Thesis • This is fine for efficient query processing of local XML • This is not sufficient for distributed data management and flexible information integration • What made the success of the relational model, i.e., of tables on a server : • A logic for defining tables • An algebra for describing query plans over tables • By analogy, we need for trees in a P2P system • A logic for defining distributed data and data services • An algebra for describing query plans over these • Proposal: ActiveXML and query optimization around it ADBIS06, Data ring
What we do next? • XML processing on a server • XML processing in a P2P system • ActiveXML • Then other data ring functions ADBIS06, Data ring
T XML query processing
Xyleme – in short • 1999: Xyleme research project at INRIA • 2000: Creation of a spin-off • 2006: About 25 people • Technology: a content warehouse built around a very efficient and scalable XML repository • Example: all articles of Le Monde in XML, financial reports in JP-Morgan ADBIS06, Data ring
Xyleme Functionalities View and Semantic stemming, integration, classification… Exploiting GUI, Web services, reporting… Query Processing Store & Index Feeding Web ADBIS06, Data ring
Xyleme Architecture Loader| Local | Query Loader| Local | Query Loader| Local | Query XML store XML store XML store Index Index Index Client side Applications IE/Java/C++/.Net Or Any Platform HTTP | Web Service API Server side or Application Server Tomcat|Soap Name Server User Manager Url Manager Notification Mgr Global Query Manager Java/C++ API Global Query Manager Corba ... ADBIS06, Data ring
Structural identifiersand indexing hash(C) LAN Put(C;[d,p,6,6,1]) hash(“John) Put(“John”;[d,p,3,1,2]) 1 A 7 0 6 2 B C 4 6 1 1 X ancestor of Y <=> pre(X) < pre(Y) and post(X) > post(Y) 3 4 7 D E 1 F 3 5 2 2 2 5 G “John” X parent of Y <=> X ancestor of Y and level(X) = level(Y) - 1 2 3 Structural IDs = Prefix-Postfix-Level ADBIS06, Data ring
Query evaluation based on Holistic twig joins (d1, 201, 400) A C D (d1, 224, 201) (d1, 228, 237) “John” (d1, 228, 237) ADBIS06, Data ring
Moving this to a P2P network – KadoP • Advantages: scaling Disadvantages: complexity • Performance • Optimization of parallelism • Avoid bottleneck • Replication • Availability • Replication • Cost • Avoid the cost of server • Share operational cost • Dynamicity • add/remove new data sources • Performance • Cost for complex queries • Communication cost • Availability • Peers can leave • Consistency maintenance • Difficult to support transaction • Quality • Difficult to guarantee quality ADBIS06, Data ring
Typically on a WAN Peers come and go Standard interface Locate(k):: log(n) messages to fing the peer in charge of key k Put(k,v) Get(k): retrieves all the values for k We tried Pastry, Chord and JXTA We now use Pastry Not satisfactory for the system we want to develop example: locate, put Distributed hash tables hash(k) DHT put(k;v3) v1,v2,v3 put(k;v1) put(k;v2) get(k) ADBIS06, Data ring
Use structured ID as in Xyleme Publish them in a DHT Use Holistic twig join Main issue: communications WAN vs. LAN Long posting lists Optimization techniques Use only docID [wisconsin] Ship smallest list Semi-join techniques Intensional indexing Indexing in KadoP hash(C) DHT hash(“John) put(C;[d,p,6,6,1]) put(“John”;[d,p,3,1,2]) hash(C) DHT ADBIS06, Data ring
Such P2P systems scale… sort of… • If we consider • The Internet • Millions of peers • A “large” number of documents per peers • A “reasonable” query load per peer • This is placing too much load on the network • Lots of queries will result in intersecting posting lists on peers that are remote, e.g. US and Europe • Besides parallelism, we should use replication intensively • To intersect two long posting lists, I should be most likely to find two replicas that are nearby eachother • Very interesting distributed data management problem ADBIS06, Data ring
Logic and algebra for distributed data management: ActiveXML
ActiveXML • Recall the standards of distributed data management: XML, Web services, XQuery • ActiveXML = XML documents with embedded Web service calls where service calls are typically in Xquery • Web service provides: • Intensional data: not explicitly here • Dynamic: a new call may provide new data XML Xquery Xpath SOAP WSDL ADBIS06, Data ring
ActiveXML = XML + embedded service calls(omitting syntactic details) <resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”) </scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> … </resorts> <depth unit=“meter”>1</depth> May contain calls to any SOAP web service to any ActiveXML web services ADBIS06, Data ring
Not a new idea in databasesNot a new idea on the Web • Mixing calls to data is an old idea • Procedural attributes in relational systems • Basis of Object Databases • In HTML world • Sun’s JSP, PHP+MySQL • Call to Web services inside documents • Macromedia MX, Apache Jelly ADBIS06, Data ring
What exactly to exchange • A parameter of a call contains some service calls • The result of a call contains some service calls • Do we evaluate these calls before transmitting the data or not • Hi Pierre, what is the phone number of Number 10 in the French soccer team? • Find his name in wc2006.com/France/team.xml then look in fifa.PhoneBook.org • Look for Zidane in fifa.PhoneBook.org • 33 4 13 00 00 01 ADBIS06, Data ring
When to activate the call • Explicit pull mode • Frequency: Daily, weekly, etc. • After some event: e.g., when another service call completed • This aspect of the problem is related to active databases • Implicit pull mode : Lazy • When the data is requested • Difficulty : detect that the result of a particular request may be affected by a particular call • This is related to deductive databases • Push mode • E.g., based on a query subscription; the web server pushes information to the client • E.g., synchronization with an external source • This is related to stream and subscription queries ADBIS06, Data ring
Peer-to-peer architecture Each ActiveXML peer Repository: manages ActiveXML data with embedded web service calls Web client: uses Web services Web server: provides (parameterized) queries/updates over the repository as web services Open source system SUN’s Java SDK 1.4 XML parser XPath processor, XSLT engine Apache Tomcat 4.0 servlet engine Apache Axis SOAP toolkit 1.0 X-OQL query processor persistent DOM repository JSP-based user interface JSTL 1.0 standard tag library see http://activexml.net ActiveXML peer ActiveXML peer soap ADBIS06, Data ring
ActiveXML is a logic for distributed tree management… and also an algebra • ActiveXML extended • Generic services: s@any • Send and receive (work on multicasting) • Stream and EoS • ActiveXML is restricted: recall we ignore node ordering • Novelties • Stream of trees vs. set of relations for the relational model • Distribution: exchange of query plans • Optimization & evaluation are interleaved ADBIS06, Data ring
A simple example //paper[//author contains “Ullman”][//title contains “magic set”] Q@any Q@any Q@any paper author title Ring indexQ@local * d21@p2 d12@p6 “magic set” “Ullman” Query plan: evaluate: q locally or ship it to p2,p6 ADBIS06, Data ring
A simple example - continued get get • Optimization of the Index computation (indexQ) send send get get get get get Author get Title send send send send send send Author Title “magic set” “Ullman” “magic set” “Ullman” ADBIS06, Data ring
Distributed query optimization • This provides an algebra for distributed query processing • Allows interleaving query processing and query optimization • Can interact with local optimizers • Based on XML flow and local XML flow processing • Most techniques that I know of can be captured • Horizontal decomposition, semi-join, replication, magic set… • Main issue to guide optimization is the gathering of statistics ADBIS06, Data ring
Context • No centralized server • No information administrator (no info manager) • Most users are non-experts • E.g., scientists • Requirements • Ease of deployment (zero-effort) • Ease of administration (zero-effort) • Ease of publication (epsilon-effort) • Ease of exploitation (epsilon-effort) • Participation in community building notably via annotations ADBIS06, Data ring