830 likes | 1.22k Views
PIER. Presentation overview PIER. Core functionality and design principles Distributed join example. CAN high level functions. Application, PIER, DHT/CAN in detail. Distributed join / operations in detail. Simulation results. Conclusion. What is PIER?.
E N D
Presentation overview PIER • Core functionality and design principles • Distributed join example. • CAN high level functions. • Application, PIER, DHT/CAN in detail. • Distributed join / operations in detail. Simulation results. • Conclusion
What is PIER? • Peer-to-Peer Information Exchange and Retrieval • A distributed query engine based on widely distributed environments • "general data retrieval system" using any source • Pier is internally using a relational database format • Read only system
>>based on CAN Network Query DHT Storage Monitoring Manager Core Optimizer Wrapper Relational Execution Other User Overlay Catalog Engine Apps Routing Manager Applications DHT PIER Tier 1 Tier 2 Tier 3 Pier overview
Relaxed Consistency • Brewer states that a distributed data system can only have two out of three of the following properties : • (C)onsistency • (A)vailability • Tolerance of network (P)artitions • Pier : Priority : A ,P and sacrifice Cie. Best effort results • Detailed in distributed join part
Scalability • Scalability – amount of work scales with the amount of nodes • Network can grow easily • Robust • PIER doesnt require a-priori allocation of resources
Tuple Wrapper Source Data sources • Data remains in it original source • Could be anything : -file system-live feed from a proces • Wrappers or gateways have to be provided
Standard schemas Relation • Design goal for application layer • Pro : Bypasses standardization process • Con : Limited by current applications Popular software (example tcpdump)
PIER is independent of the DHT. Currently it is CAN. • Currently using multicast. Other strategies are possible. (Further explained in DHT)
Presentation overview PIER • Core functionality and design principles • Distributed join example. • CAN high level functions. • Application, PIER, DHT/CAN in detail. • Distributed join / operations in detail. Simulation results. • Conclusion
DHT based distributed join: example • Distributed execution of relational database query operations, e.g. join, is the core functionality of the PIER system. • Distributed join is a relational database join performed to some degree in parallel by a number of processors (machines) containing different parts of the relations on which the join is performed. • Perspective: generic intelligent "keyword" based search based on distributed database query operations (e.g. like Google). • The following example illustrates what PIER can do and thus the main purpose of PIER, by means of distributed join based on DHT. Details of how and which layer is doing what are provided later.
DHT based distributed join example: relational database join proper (1/2)
DHT based distributed join example: relational database join proper (2/2)
Presentation overview PIER • Core functionality and design principles • Distributed join example. • CAN high level functions. • Application, PIER, DHT/CAN in detail. • Distributed join / operations in detail. Simulation results. • Conclusion
CAN • CAN is a DHT • Basic operations on CAN are insertion, lookup and deletion of (key,value) pairs • Each CAN node stores a chunk (called zone) of the entire hash table
Every node is responsible of a small number of “adjacent” nodes(called zones) in the table • Requests (insert, lookup, delete) for a particular key are routed by intermediate CAN nodes towards the CAN node that contains the key
Design centers around a virtual d-dimensional Cartesian coordinate space on a d-torus • Coordinate space is completely virtual and has no relation to any physical coordinate system • Keyspace is probably SHA1 Map Key k1 onto a point p1 using a Uniform Hash function (k1,v1) is stored at the node Nx that owns the zone with p1
Retrieving a value for a given key • Apply the deterministic hash function to map key onto point P and then retrieve the corresponding value from the point P. • One hash function per dimension get(key) -> value:- lookup(key) -> ip address.- ip.retrieve(key) -> value
Storing (key,value) • Key is deterministically mapped onto a point P in the coordinate space using a uniform hash function • The corresponding (key,value) pair is then stored at the node that owns the zone within which the point P lies put(key,value):- lookup(key) -> ip address- ip.store(key,value)
Routing • CAN node maintains routing table that holds the IP address and virtual coordinate zone of each of its immediate neighbors in the coordinate space • Using its neighbor coordinate set, a node routes a message towards its destination by simple greedy forwarding to the neighbor with coordinates closest to the destination coordinates
(16,0) Key = (15,14) Data (0,0) (0,16) • Node Maintains routing table with neighbors • Follow the straight line path through the Cartesian space (16,16)
Node Joining • New node must find a node already in the CAN • Next, using the CAN routing mechanisms, it must find a node whose zone will be split • Neighbors of the split zone must be notified so that routing can include the new node
I CAN: construction new node 1) Discover some node “I” already in CAN
CAN: construction (x,y) I new node 2) Pick random point in space
CAN: construction (x,y) J I new node 3) I routes to (x,y), discovers node J
CAN: construction new J 4) split J’s zone in half… new node owns one half
Node departure • Node explicitly hands over its zone and the associated (key,value) database to one of its neighbors • Incase of network failure this is handled by a take-over algorithm • Problem : take over mechanism does not provide regeneration of data • solution:every node has a backup of its neighbours
Presentation overview PIER • Core functionality and design principles • Distributed join example. • CAN high level functions. • Application, PIER, DHT/CAN in detail. • Distributed join / operations in detail. Simulation results. • Conclusion
Querying the Internet with PIER(PIER = Peer-to-peer Information Exchange and Retrieval)
What is a DHT? • Take an abstract ID space, and partition among a changing set of computers (nodes) • Given a message with an ID, route the message to the computer currently responsible for that ID • Can store messages at the nodes • This is like a “distributed hash table” • Provides a put()/get() API
Key = (15,14) Data Given a message with an ID, route the message to the computer currently responsible for that ID (16,16) (16,0) (0,0) (0,16)
Lots of effort is put into making DHTs better: • Scalable (thousands millions of nodes) • Resilient to failure • Secure (anonymity, encryption, etc.) • Efficient (fast access with minimal state) • Load balanced • etc.
Select R.Cpr,R.name, s.Address From R,S Where R.cpr=S.cpr Declarative Queries Query Plan Overlay Network Physical Network >>based on Can
Applications Any Distributed Relational Database Applications Network Monitoring Feasible Applications Intrusion detection Fingerprint queries Load Monitor of CPU Split Zone
DHTs • Implemented with CAN (Content Addressable Network). • Node identified by rectangle in d-dimensional space • Key hashed to a point, stored in corresponding node. • Routing Table of neighbours is maintained. O(d)
DHT Design • Routing Layer Mapping for keys (-- dynamic as nodes leave and join) • Storage Manager DHT based data • Provider Storage access interface for higher levels
DHT – Routing Routing layer maps a key into the IP address of the node currently responsible for that key. Provides exact lookups, callbacks higher levels when the set of keys has changed Routing layer API Asynchronous Fnc lookup(key) ipaddr synch fnc Local Node join(landmarkNode) leave() locationMapChange()
DHT – Storage Storage Manager stores and retrieves records, which consist of key/value pairs. Keys are used to locate items and can be any data type or structure supported Storage Manager API store(key, items) retrieve(key) items --Structure remove(key)
DHT – Provider (1) Provider ties routing and storage manager layers and provides an interface • Each object in the DHT has a namespace, resourceID and instanceID • DHT key = (hash1(namespace,resourceID) +..+ hashN(namespace,resourceID)) Depends on dimension On CAN 0.2160 • namespace - application or group of object, table or relation • resourceID – primary key or any attribute(Object) • instanceID– integer, to separate items with the samenamespace and resourceID • Lifetime - item storage duration --adherence principle of relaxed Consistency CAN’s mapping of resourceID /Object is equivalent to an index
DHT – Provider (2) Provider API get(namespace, resourceID) item put(namespace, resourceID, item, lifetime) renew(namespace, resourceID, instanceID, lifetime) bool multicast(namespace, resourceID, item) lscan(namespace) items (Structure/Iterator) newData(namespace, item)