330 likes | 449 Views
KadoP: a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud. Context. MDP2P – Project “ Masse de Données en P2P ” KadoP: Joint work with Ioana Manolescu and Nicoleta Preda, INRIA-Futurs (Orsay) and University Paris Sud (thesis of Nicoleta)
E N D
KadoP: a P2P contentsharing systemSerge AbiteboulINRIA-Futurs (Orsay) and University Paris Sud
Context • MDP2P – Project “ Masse de Données en P2P ” • KadoP: Joint work with Ioana Manolescu and Nicoleta Preda, INRIA-Futurs (Orsay) and University Paris Sud (thesis of Nicoleta) • Article in EDBT and demo in DataEngineering
Organization • Introduction • The basis: XML, DHT, ActiveXML • KadoP • Query processing • The implementation • Conclusion
Peer-to-peer • A large and varying number of computers cooperate to solve some particular task without any centralized authority • Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network • Examples • seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net • cabal: decryption of 512 bits RSA code • grub: P2P Web search
Publication of resources (XML and knowledge) Storage of resources Access to resources Acquisition/Enrichment/Exploitation Focus here on query processing Precise answers taking into account the text, the structure and the semantics of XML documents. Data management in P2P Peer Peer Peer Peer Peer Peer Internet
Standards of distributed data management • Standard for data exchange: XML • Extensible Markup Language • Labeled ordered trees • Standard for query languages • XPATH, Xquery • Standards for distributed computing • Web services: SOAP, WSDL • ActiveXML = XML documents with embedded Web service calls • Intensional • Dynamic XML • Xquery • Xpath SOAP WSDL
ActiveXML = XML + embedded service calls(omitting syntactic details) <resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”) </scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> … </resorts> <depth unit=“meter”>1</depth> May contain calls to any SOAP web service to any ActiveXML web services
ActiveXML peer ActiveXML peer soap • Each ActiveXML peer • Repository • Web client: • Web server • Open-source in ObjectWeb • see http://ActiveXML.net • Based on standards libraries • SUN’s Java SDK 1.4 (XML parser, XPath processor, XSLT engine) • Apache Tomcat 4.0 servlet engine, Apache Axis SOAP toolkit 1.0 • X-OQL query processor (soon? Replaced by eXist XML-db)
locate(k) put(k,v1): hash(k) determines peer Ph(k) where (k,v1) is kept get(k) retrieves v1,v2… from Ph(k) delete(k,v) Management of the overlay network is complex because peers come and go We use Pastry We have tried others: Chord, Jxta Distributed hash tables get(k) put(k,v1) DHT put(k,v2) k: v1 ,v2
KadoP data items • XML documents and web services • XML sub-trees, views and collections of documents • Labels, words and stemming of these words • Types • DTD and XSD for documents, WSDL for services • also • ActiveXMLdocuments and ActiveXML services • Ontologies • Concepts, isa, etc.
Goal • Find relevant information to answer a query • May require some extensional information • May require to call some Web services • May require some elaborate query plan including service composition • Simple examples • Find me Emacs packages • Find me Emacs packages that were modified last week • Find me the packages depending on Emacs in my Linux system
KadoP architecture KadoP peer publish & query Semantic layer Web interface External Layer ActiveXML engine KadoP Engine Indexing Logical Layer Query processing DHT locate, put, get & delete Physical Layer Index
Java/JSP application on each peer KadoP: Distributed index EDOS distribution system ActiveXML: Data/metadata storage IDiP : dissemination management BitTorrent : efficient download Architecture Kadop Kadop
Efficient evaluation of tree-pattern-queries • Many optimization techniques • We are interested here in distributed query evaluation/optimization • 1) We consider XML indexing • 2) Holistic twig join that is based on indexing • 3) P2P indexing • 4) P2P query processing • 5) Optimizing P2P indexing
XML indexing: structural identifiers 1 A 8 0 7 2 B C 8 6 1 1 X ancestor of Y <=> pre(X) < pre(Y) and post(X) ≥ post(Y) 3 8 5 D F E 4 8 6 2 2 2 6 4 G X parent of Y <=> X ancestor of Y and level(X) = level(Y) - 1 “John” 6 4 3 3 -Level Structural IDs = Prefix-Postfix
Holistic Twig Join • Input a document and a tree pattern query • Find the bindings of the query in the document • Holistic = holistique • (le tout et pas juste les parties) • Twig = brindille • Join = jointure • Sounds like Harry Potter?
Query evaluation over a document A D C Ids for A (1,8,0)… Ids for C Ids for D “John” Ids for “John” Ids are sorted in lexicographical order Goals is to find “matching Ids”
The Holistic Twig Join Algorithm a b c a (2,8) a (18,25) c (9,14) a (15,17) a (3,5 c (6,8) b (12,14) c (23,25) c (7,8) b (13,14) c (4,5) b (24,25) a (8,8) c (25,25) c (14,14) a (5,5) level 0 r (1,25) 1 b (10,11) a (16,17) b (19,22) 2 c (11,11) c (17,17) b (20,21) 3 4 c (21,21) c (22,22)
The Holistic Twig Join Algorithm (a7, b4, c8), (a7, b5, c8), Sa b5 Sb Sc Stacks (a7, b4 ,c9) (a7 ,b6 ,c11) a a7 a1 a5 a7 a4 a2 a6 a3 b6 b4 b1 b2 b4 b6 b b5 b3 c1 c2 c10 c5 c11 c9 c8 c6 c7 c4 c3 c8 c9 c11 c Legend: This is the end Head of the stream Find the match for the query sub-tree determined by this node !!! The ID is present also in the stack
Also: Intensional data • Example: include and references • Example: function calls in ActiveXML • Find me the packages depending on Emacs in my Linux system • package (name, author, size, signature, dependsOn(self)…) • the depending packages are intensional • Naïve: return empty answer • Brutal: return all documents with a function call • What we do: use indexing (and typing)
Some technical issues • Common belief: this cannot work because of transfer delays • Indeed, first experiments were a disaster • DHT did not scale – not designed for so many entries • Transfers of long posting lists were killing the system • Our target : make it work in some modest setting • with millions of documents • with thousands of peers • with not too volatile peers • (not Kazaa or GoogleSearch but industrial application)
Let’s make it work • Some of the early observations of MDP2P and solutions • Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB) • Extend the API of the DHT to have Append and not only Read/Write • Extend the API of the DHT to have a streaming exchange of postings (for long postings) • Useful because the XML algebra works better with streams • Now KadoP scales but can be optimized • We will see here one optimization technique: DPP
Long posting = bad response time No long posting get h(name) then parallel fetch Possibility to optimize further f(docId55..docId75) may be it does not match no need to call f Distributed PostingPartitioning Distributed B-tree long posting h(Name) f g h i h(Name)
Main issues • Scaling: Optimize query processing • Adapting Bloom filter and other known techniques • on going in Gemo • Scaling: main tool is replication • Issue are consistency and overhead • On going work in MDP2P/Atlas • Dynamicity: better manage peers entering/leaving the system
A KadoP application: Data management in Edos • The distributed of a large software to the peers developing in • Mandriva Linux distribution: 10 000 packages + metadata between up to 1 000 peers • Thousands of packages (about 9000 in Mandriva) • Package metadata in XML • And why not: bug reports, annotations, emails, etc. • Goal: distribute & query & monitor & getPackage • Techno: ActiveXML + KadoP + Idip + BitTorrent
Conclusion • V1 of KadoP and EdosDistribution are running • Open-source • Management of XML resources in P2P • Management of semantic and web services as well • Based on active data (ActiveXML) and DHT (FreePastry) • Novelties in KadoP • Management of data and knowledge in P2P • Use of intensional information • Original optimization techniques • Future work • ANR Platform for content management: webContent
Merci Merci