420 likes | 491 Views
AmbientDB Relational Query Processing in a P2P Network. Peter Boncz and Caspar Treijtel LEE BYUNGIL PL Lab. Hongik University 2004.11.14. Outline. 1. Introduction 1.1 Goal 1.2 Assumptions 1.3 Example: Collaborative Filtering in a P2P Database 1.4 Overview
E N D
AmbientDBRelational Query Processing in a P2P Network Peter Boncz and Caspar Treijtel LEE BYUNGIL PL Lab. Hongik University 2004.11.14
Outline 1. Introduction 1.1 Goal 1.2 Assumptions 1.3 Example: Collaborative Filtering in a P2P Database 1.4 Overview 2. AmbientDB Architecture 2.1 Data Model 2.2 Query Execution in AmbientDB 2.3 Dataflow Execution 2.4 Executing the Collaborative Filtering Query 3. DHTs in AmbientDB 3.1 Example: Approximated Collaborative Filtering 4. Conclusion
1. Introduction (1) • AmbientDB • A new peer-to-peer (P2P) DBMS prototype • Developed at CWI (Centrum voor Wiskurde en Informatica) • Distributed an ad-hoc P2P network • Global query algebra • Multi-wave stream processing plans • Ambient Intelligence (AmI) • Digital environments in which multimedia services are sensitive to people’s needs
Music Playlist Scenario • amP2P player • Log - mata information • Homogeneous • Content - AmbientDB instance, or external sources • Heterogeneous • AmbientDB • Its collection • Only Meta-information
1.1 Goal • Full relational database functionality • Cooperate in ad-hoc way with other AmbientDB devices • Propose • A general architecture for AmbientDB • Complex query processing in ad-hoc P2P network
1.2 Assumptions (1) • Upscaling (flexibility) • Amount of cooperating devices to be potentially large • Home environment and ad-hoc P2P network • Downscaling • Devices often have few resources (CPU, memory, network, battery) • Schema integration • All devices operate under a common global schema • Data placement • Data placement is determined by user • Network failure • Resilience of Chord • While a query runs, the routing tree stays intact
1.2 Assumptions (2) • Distributed database • Priori • Not in AmbientDB • Federated database • Statically Heterogeneous schema integration • Mobile database • Centralized database server and client (mobile node) • P2P file sharing system • Non-centralized and ad-hoc topologies • Simple keyword text search
Example Music Schema • The global schema “AMP2P” in AmbientDB • distributed table • On the global level • The union of all horizontal fragments of these tables
1.3 Example : CollaborativeFiltering in a P2P Database (1) • amP2P player • Access to a local content repository (digital music collection) • AmbientDB instance • Share all music content in the “home zone” • Only share the meta-information in the huge P2P network
1.3 Example : CollaborativeFiltering in a P2P Database (2) • Memory-based implicit voting scheme • Predicted vote for the active user for item j • vi,j = the vote of user i on item j • w(a,i) = weight function defined on the active user and user i • vi = average vote for user i • k = nomalizing factor • weight(usera, useri) • Times the example song has been fully played by user i • Refined form • Negative information – skipped
1.4 Overview • General architecture • Include Data model • Query execution • Three-level query execution process • DHT (Distributed Hash Table) • Global table indices • Optimize the query • Related work & future work • Conclusion
2. AmbientDB Architecture • Distributed Query processor • Execute query on all ad-hoc connected devices • P2P protocol • Chord • scalable lookup and routing scheme • P2P IP overlay networks made out of unreliable connections • Query node = root • A small number of connections per node • Simultaneous bi-directional communication and query processing • DHTs – global table indices • Local DB component • Local table • Embedded database • External data source – wrapper component (distributed database system) • Schema integration engine • Meta-data translation • Using view-based schema mappings
2.1 Data Model (1) • Standard relational data model & algebra as query language • Query are formulated against global tables • Local node or limited set of node or all reachable nodes • Converging answer • Query locally • Re-issue iteratively over more nodes
2.1 Data Model (2) • Abstract Table • LT (Local Table) • Each node has private schema • Global schema – global table T • All participating nodes Ni carry a table instance Ti • In query node • Ti may be accessed as a LT • DT (Distributed Table) • Q : Set of node that participate in some global query • The union of local table instances
2.1 Data Model (3) • PT (Partitioned Table) • Specialization of the DT • All participating tuples in each Ti are disjunct between all nodes • Advantage over DT • Exact query answers can often be computed in an efficient distributed fashion • By broadcasting a query and letting each node compute a local result without need for communication • Attaching a bitmap index Ti.Q to each local table Ti • “virtual” column • #NODEID • Be aware in which node are located • Stored in a DT/PT • Location-specific query restrictions
2.2 Query Execution in AmbientDB (1) • Three level translation • Abstract level • User query • Selection, join, aggregation, sort • Lists • (List<Type>) • List instances • <a,b,c> • Concrete level • Table parameters, return value • Partition, union • Execution level • Wave-plans
2.2 Query Execution in AmbientDB (2) • Starting at the leaves • Abstract query plan -> concrete • Concrete operator have concrete result type • Process continue to the root of the query graph • Local result table, hence LT • Local concrete variant of all abstract operators • All tables -> LT • Concrete union • (T1)-> LT • More efficient alternative query plans
2.2 Query Execution in AmbientDB (3) • select, aggr, order support distributed execution(dist) • Execute in all node on their local partition (LT) of a PT or a DT • Produce again a distributed result (PT or DT) • Broadcast the query through the routing tree • The result is again dispersed over all node as a PT or DT • Aggrmerge = aggrlocal(unionmerge(DT)):LT • Reduce the fragments to be collected in the query node • Save considerable bandwidth
2.2 Query Execution in AmbientDB (4) • join variants • Broadcast join (LT, T1)->T1 • Foreign-key join (T1,DT)->T1 • Referential integrity to minimize communication • Split join (LT1,T1)->T1 • Reduce bandwidth consumption • O(T*N) -> O(T*log(N)) • partition • A special operator that performs double elimination • Create a PT from a DT by creating a tuple participation bitmap at all nodes • To be able to use the dist operators • We should convert a DT to a PT
2.3 Dataflow Execution (1) • Query processing paradigm • Routing tree using TCP connections is used to pass bi-directional tuple streams • Multiple simultaneous such waves (upward and downward) • Third translation phase • Concrete query plan -> wave-plans • Concrete operator • One or more waves (Local dataflow aglebra operators)
2.3 Dataflow Execution (2) • dist plans for select, aggr, order and foreign-key join • buffer-to-buffer local operator in each node, without further communication • broadcast join • Propagates a tuple wave through the network • split • Split(<true,true>,<c1,c1>) • Ordered -> effectively forming a DT/PT • scan-select, quick-sort, merge-join, heap-based top-N, ordered aggregation • All stream-based • Require little memory
2.4 Executing the Collaborative Filtering Query (3) • Problems • Query 1 • Large list of all users that have ever listened to the example song • Hog resources from all nodes in the network • Query 2 • Basically send all log record to the query node for aggregation • More efficiently in an AmbientDB enriched with DHTs
3. DHTs in AmbientDB (1) • Useful lookup structures for large-scale P2P applications • Reduce the amount of nodes involved in answering a query • Involving many nodes • Decrease query performance • Create an overload in the average query frequency • Gnutella (not use DHT or global indices) • Easy to locate popular music • Difficult to locate less wel-known songs
3. DHTs in AmbientDB (2) • To enable the query optimizer to automatically accelerate selection queries using such DHTs • DHT indices can be exploited by a query optimizer to accelerate lookup queries • Special form of a PT, as the partitions are disjunct • selectchord(DHT):LT • Dataflow level • Route a message to the Chord finger on which the selection key-value hashes • Retrieving all corresponding tuples as an LT via a direct TCP/IP transfer • Non-complete index
3.1 Example: Approximated Collaborative Filtering (1) • HISTO • Static histogram of fully-listened-to songs per user • Reduce the histogram computation cost of query
4. Conclusion • Full query processing architecture • Executing queries in a declarative, optimizable language, over an ad-hoc P2P network • DHT • Efficient global indices