220 likes | 315 Views
A Physical Query Algebra for DHT-based P2P Systems. Kai-Uwe Sattler 1 , Philipp Rösch 1 , Erik Buchmann 2 , Klemens Böhm 2 1 Department of Computer Science and Automation, TU Ilmenau 2 Department of Computer Science, University of Magdeburg. Distributed Hash Tables.
E N D
A Physical Query Algebra for DHT-based P2P Systems Kai-Uwe Sattler1, Philipp Rösch1, Erik Buchmann2, Klemens Böhm2 1Department of Computer Science and Automation, TU Ilmenau 2Department of Computer Science, University of Magdeburg
Distributed Hash Tables • Examples: CAN, CHORD, PASTRY, etc. • Advantages of P2P systems, e.g., • No SPOF, shared infrastructure costs, censorship-resistance • Manage huge sets of (key, value)-pairs • Cope with large numbers of parallel transactions • Efficient query processing: • Greedy forward routing, • But only simple exact-match queries on unstructured data sets A Physical Query Algebra for DHT-based P2P Systems
Extended Queries in DHT • Some extensions: • Trigrams - text retrievalbeethoven: bee eet eth tho hov ove ven • Bloom filters - hash-based AND • Feature vectors - multimedia documents • But: • Extensions are application-specific • No universal query algebra • Idea: • Relational data sets, SQL-like queries Applications: management of genom data, semantic web, distributed indexes A Physical Query Algebra for DHT-based P2P Systems
Relational Data in DHT? • Storing relational data in DHT • Fragmentation scheme? • Accessing secondary keys? • Support for SQL-like query processing • Distribution scheme for complex queries? • Join operations? • Full-table scan without flooding? • Exploiting the P2P nature • No central instance, no global knowledge • Parallel processing • Problems with availability and failures A Physical Query Algebra for DHT-based P2P Systems
Outline of Our Approach • Use Content-Addressable Networks (CAN) • Locality-aware hash function • Preserving neighborhood of similar tuples • Space-filling curve • API Extension • Multicast • Temporary re-hashing • Distributed query plan operators (POP) • Selection, join, grouping/aggregation • POP distribution scheme A Physical Query Algebra for DHT-based P2P Systems
Content-Addressable Networks • Proposed by S. Ratnasamy (2001) • Keys: d-dimensional points • Key space is a torus in d dimensions • Example: d=2 A Physical Query Algebra for DHT-based P2P Systems
Zones and Neighbors in CAN • Each peer is responsible for one zone, i.e., stores all (key, value) pairs of the zone • Each peer knows the neighbors of its zone • Random assignment of peers to zones at startup • Overloading of zones, multiple realities, ... A Physical Query Algebra for DHT-based P2P Systems
Greedy Forward Routing in CAN • get(k): • Forward request to that neighbor whose zone is closest to k • Repeat until the peer responsible for k is reached (k,v) get(k) A Physical Query Algebra for DHT-based P2P Systems
Managing Relational Data:Simple Approach • Relation r R,Tuple t r, t = {ak, a1, ..., an }Key k‘ = h(ak) • Problems: • Tuples are irregularly disseminated over the key space, i.e., only exact-match queries are supported • No search for attributes other than primary key x x x σ5<ak<10(r) ? x x σab=20(r) ? x x x A Physical Query Algebra for DHT-based P2P Systems
RelationID Key (RelationID, Key Value) hk hr 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 Dimension #2 Dimension #1 Fragmentation Scheme • Reverse bit interleaving (z-curve) • Tuple t r, t = {ak, a1, ..., an } • Two hash functions:Key k‘ = hr(r) ° hk(ak) Key k‘ = h(ak) (1,2) A Physical Query Algebra for DHT-based P2P Systems
ra, rb, rc Two Hash Functions • Key k‘ = hr(r) ° hk(ak) • hr(r): RelationID determines the placement of the space-filling curve • hk(ak): primary key determines the position on the curve,locality-awarenessak = 0, 1, 2, 3, 4, ... A Physical Query Algebra for DHT-based P2P Systems
Additional API Primitives • Standard operations: put(k, v), v=get(k) • Only two additional operations needed for our query algebra: put_temp(), multicast()put_temp(k, v, t) • Re-hashing of a given relation • Temporary put-operation • Allows indexed access to other attributes than the primary key A Physical Query Algebra for DHT-based P2P Systems
Additional API Primitives (Cont.) multicast(zmin, zmax, POP) • Sends a message to a group of peers • Peers are identified by an interval of the z-curve Example: σ3<ak<6(r) multicast(3,6, POP) send(σak=3) send(σ4<ak<6) A Physical Query Algebra for DHT-based P2P Systems
T S R Query Plan Operators (POP) • Hash-based implementation for selection, join, grouping, aggregation • Distributed query processing • Operator Trees A Physical Query Algebra for DHT-based P2P Systems
Selection • Selection POP • On the primary key: • Example: σ3<ak<6(r) • Determine the interval on the z-curve • Send selection operator via multicast • On other attributes: • Example: σ3<a5<6(r) • Perform full-table scan, e.g., multicast( min(a5), max(a5), POP) A Physical Query Algebra for DHT-based P2P Systems
Join • Nested Loop Join POP, Symmetric Hash Join POP • On the primary key: • Perform join immediately • On other attributes: • Re-hash the relation using put_temp first • Perform join as above A Physical Query Algebra for DHT-based P2P Systems
shjoin(R,S) put_temp(h(tR),tR,x) S1 shjoin(R,S) RS1 R1 R S shjoin(R,S) shjoin(R,S) S2 RS2 R2 put_temp(h(tS),tS,x) Example: Symmetric Hash Join A Physical Query Algebra for DHT-based P2P Systems
Sorting/Aggregation • Central grouping POP: • One peer iterates over the z-curve, performs central sorting/aggregation • Hash group POP: • Re-distribute the relation using a hash function on the attribute to be sorted/aggregated • “Aggregation Peers” are responsible for sorting/aggregation of incoming attribute values A Physical Query Algebra for DHT-based P2P Systems
T S R Query Evaluation • Input • Left-handed POP trees • Design Principles • Stateless evaluation • Blocking operations:delivery of intermediate data (early aggregation) A Physical Query Algebra for DHT-based P2P Systems
rra rr rrb r2 r1 Query Evaluation: Example P1 r1 P4 P0 P2 P0 r2a P5 P3 r2b A Physical Query Algebra for DHT-based P2P Systems
Conclusion • Current state: • Prototype is fully implemented • Execution of plans like(shjoin a1=a2 (scan a3>42 REL1) (scan REL2)) • First experiments in small CAN (100 Peers) are promising A Physical Query Algebra for DHT-based P2P Systems
Conclusion (cont.) • Future topics: • Experiments with large data sets and many nodes (100,000 nodes, 10 mio. queries, test data from the TCP-H benchmark) • Optimization of the different POP implementations • Efficient range queries • Dynamic query operations A Physical Query Algebra for DHT-based P2P Systems