580 likes | 676 Views
Searching and Data Sharing in P2P Systems. Beng Chin Ooi Department of Computer Science National University of Singapore ooibc@comp.nus.edu.sg www.comp.nus.edu.sg/~ooibc. Acknowledgement. A few ppt slides are borrowed/adapted from Hellerstein’s group and his vldb-04 tutorial slides
E N D
Searching and Data Sharing in P2P Systems Beng Chin Ooi Department of Computer Science National University of Singapore ooibc@comp.nus.edu.sg www.comp.nus.edu.sg/~ooibc
Acknowledgement • A few ppt slides are borrowed/adapted from Hellerstein’s group and his vldb-04 tutorial slides • Some are screen dumps as examples
Client Server Architecture Peer-to-Peer Architecture What is P2P?
P2P Systems? • Effective Use of the Internet-connected PCs/workstations directly participate in the Internet • Sites are autonomous • Similar functionalities and responsibilities • Each peer consumes and serves • Resources are distributed
Driving Forces • Main driving forces: • Exploiting existing resources • Computational efficiency is not the main goal • Sharing costs among users • Autonomy • Anonymity • Legal protection
P2P Systems “ A class of applications that takes advantage of resources like storage, CPU cycles, content and even human presence available at the edges of the Internet” -- Clay Shirkey, an investment advisor
P2P Applications Groove P2P Messenger SETI Folding@home Upriser freenet
Properties of P2P Applications? • Dynamic and Self-Organizing • Enduring • Resilient • Collaborative
P2P Future • Aberdeen Group’s prediction: • US$930 million by end 2004 • From US$20.6 at end of 2000 • Standardization • NPI (New Productivity Initiative) • Peer-to-Peer Working Group (P2PWG) • NAT, Taxonomy, Security, File Services, Interoprability
Overlay Networks • P2P applications need to: • Track identities & (IP) addresses of peers • May be many! • May have significant Churn • Best not to have n2 ID references • Route messages among peers • If you don’t keep track of all peers, this is “multi-hop” • This is an overlay network • Peers are doing both naming and routing • IP becomes “just” the low-level transport • All the IP routing is opaque • Control over naming and routing is powerful • And as we’ll see, brings networks into the database era
Infecting the Network, Peer-to-Peer • The Internet is hard to change. • But Overlay Nets are easy! • P2P is a wonderful “host” for infecting network designs • The “next” Internet is likely to be very different • “Naming” is a key design issue today • Querying and data independence key tomorrow? • Don’t forget: • The Internet was originally an overlay on the telephone network • There is no money to be made in the bit-shipping business • A modest goal for DB research: • Don’t query the Internet.
The Evolution of P2P systems • First generation – centralized P2P systems • E.g. Napster, SETI@home • Second generation –decentralized & unstructured P2P systems • E.g. Gnutella • Third generation—structured P2P systems • DHT systems (CAN/Chord/Pastry/Tapestry) • Skip-list based systems • ….
Unstructured P2P Systems • P2P with Central Servers • P2P with fully Autonomous Peers (pure p2p) • P2P with Superpeers (SuperNodes)
Who has X? B has X Get X Reply with X A B Directory Server Unstructured Centralized P2P Systems -- Napster • Searching is efficient, with only a few messages exchanged; • Non-scalable, a central point of failure;
Unstructured Fully Decentralized -- Gnutella • Searching is inherently flooding (unscalable); • Time-to-Live(TTL) is used to partially address this problem;
Techniques for improving search in Gnutella-like Network • Expanding Ring; • Random Walks; • Good Peer; • Local indices; • Routing indices;
Worst Case for Freenet • Peer F has the requested file, but never finds it because a poor routing • decision made at Peer D, and results in the query not being matched. In this case, query will be rerouted once again with alternate path
Unstructured P2P with Supernodes • Combine the benefits of centralized and decentralized search; • Take advantage of the heterogeneity of peer capabilities;
Morpheus Supernode Layer
What is Grid? “A hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” -- Ian Foster & Kal Kesselman, 1998 “Sharing enviorment implemented via the deployment of a persistent, standards-based service infrastructure that supports the creation of, and resource sharing within distributed communities” --Ian Foster & Adriana Iamnitchi, 2003
The evolution of Grid Systems • First generation systems involved proprietary solutions for sharing high performance computing resources; e.g. Condor • Second generation systems introduced middleware to cope with scale and heterogeneity, with a focus on large scale computational power and large volumes of data; e.g. Globus, Eu DataGrid • Third generation systems are adopting a service-oriented approach, adopt a more holistic view of the e-Science infrastructure, are metadata-enabled and may exhibit autonomic features. • Open Grid Services Architecture (OGSA)
P2P vs. Grid --similarities • Both P2P and Grid address the same problem, share the same goal • Resource sharing within distributed resources. • Both offer promising paradigms for developing distributed systems and applications
P2P vs. Grid --differences • Resources • Grid– higher-end resources, better connected with high levels of availability • P2P– edge level devices, intermittently connected with highly variable availability
P2P vs. Grid --differences • Services • Dependent on the nature of communities • Eg 1. Resource Discovery • Grid—very well structured and stable network making this less of an issue • P2P—unstable network • Eg 2. Security • Grid—authentication, authorization, accountability • P2P—anonymity, censorship resistance
P2P vs. Grid --differences • Infrastructure • Grid – more emphasis in standardization, interoperability • P2P – little emphasis, no interoperability • Applications • Grid – large range of applications, more computation and data intensive • P2P – more social-based, less computation and data intensive
P2P vs. Grid --differences • Scalability • Grid– Most services, such as resource discovery, are mainly based on centralized or hierarchial models • P2P– Most P2P systems are decentralized
P2P vs. Grid --summary • Grid needs to address more in decentralization, self-organization, fault tolerance, and scalability issues, which are strong points of P2P. • P2P should put more effort on standard infrastructure and provide more services. • The P2P model could help to ensure Grid scalability • Two technologies are likely to converge (grid + structured p2p)
Data sharing in P2P systems • Provide only file-level sharing, and lack of content-based search • coarse granularity of information sharing. • Lack of extensibility and flexibility • no easy and rapid means to expand applications • Node’s neighbors are typically statically defined • difficult to utilize network bandwidth and optimize system performance
Relational data sharing in Unstructured P2P vs. Distributed DB
P2P & DB Systems DB P2P Taken from Hellerstein’s group ppt
P2P + DB = ? • P2P Database? No! • ACID transactional guarantees do not scale, nor does the everyday user want ACID semantics • Much too heavyweight of a solution for the everyday user • Query Processing on P2P! • Both P2P and DBs do data location and movement • Can be naturally unified (lessons in both directions) • P2P brings scalability & flexibilityDB brings relational model & query facilities Taken from Hellerstein’s group ppt
Many New Challenges • Relative to other parallel/distributed systems • Partial failure • Churn • Few guarantees on transport, storage, etc. • Huge optimization space • Network bottlenecks & other resource constraints • No administrative organizations • Trust issues: security, privacy, incentives • Relative to IP networking • Much higher function, more flexible • Much less controllable/predictable
Some Proposals on Data Sharing… • Database: • Data Mapping (SIGMOD’03) • Piazza (ICDE’03) • PeerDB(ICDE’03) • … • IR: • PlanetP((HPDC’03) • SummaryIndex (TKDE’04 special issue on P2P) • …
The Birth of BestPeer… • Started in 1998 • To steal storage and CPU cycles from staff machines • To provide a virtual and parallelised content-based document retrieval system • To be able to move processes from one PC to another quickly when users need the PC back • Extended to P2P in early 2000 • VC showed interested in the project • W.S. Ng, B. C. Ooi and K.L. Tan: BestPeer: A self configurable peer-to-peer system. ICDE’2002.
BestPeer Network • BestPeer is a generic P2P system designed to serve as a platform on which P2P applications can be developed easily and efficiently • Integrate mobile agent with P2P technologies • Each participant runs BestPeer software • Provide communication facilities and share resources with other peers • Provide an environment in which agent can reside and perform their tasks
BestPeer Network cont… • Large # of peers, Small # of LIGLO; • Each node comprises of two types of data: private data and sharable data; • New node registration: • Register with LIGLO • Obtain a unique BPID from LIGLO. • LIGLO sends a list of (BPID, IP) pairs that node can communicate directly. • Node is ready to communicate to other peers.
BestPeer Network cont… • Node Rejoins: • Send node’s current IP to LIGLO • For each peer of the node, p, send p’s BPID to its registered LIGLO • p’sregistered LIGLO will reply with IP of p if it is currently connected to the network • Node has rejoined
BestPeer Network cont… • Access Data from other nodes: • Propagation broadcast • Node with matching result will respond to initiating node directly • Two modes to access data: • Phase 1: Node with matching answer will return the result directly or Node with matching answer will only indicate that they have the information • Phase 2: The initiating node will then send a further message to some, if not all, of these nodes to obtain desired information
Reconfigurable BestPeer Network • A node in the BestPeer network can dynamically reconfigure itself by keeping peers that benefit it most. • Based on assumption: peers that benefit a node most for a query are most likely to provide the greatest gain for subsequent query. • Every node has its control of maximum number of direct peers it can have
Reconfigurable BestPeer Network cont… • BestPeer applies autonomous strategy, where each node tries to keep promising peers as closes as possible with no information exchange between peers. • BestPeer provides two default reconfiguration strategies: • MaxCount • Maximizes the number of objects a node can obtain from its directly connected peers. • MinHops • Minimizes the number of Hops that a node needs to travel
Location-Independent Global Names Lookup Server (LIGLO) • To facilitate identification of a single node that may have different IP addresses at different occasion • LIGLO is a node that has a fixed IP and running LIGLO software • LIGLO: • Generates BestPeer Global Identity (BPID) • Maintains peer’s current status • LIGLO applies distributed approach, each LIGLO only needs to maintain its members’ name
Features of BestPeer • Combines the power of agent technology and P2P technology in a single system • Supports a finer granularity of data sharing, and sharing of computational power • Facilitates dynamic reconfiguration of BestPeer network • Adopts a distributed approach to minimize bottlenecks of servers acting as LIGLO
Integrating of Mobile Agent and P2P Technologies • P2P technologies provide resources sharing capabilities among node; Mobile Agent further extends the functionalities • Java-based Agent System • BestPeer Search Agent vs. Traditional Search Agent: • (Trad) Predefined itinerary vs. Auto and transparent • TTL / Hops based lifetime • Result/Cost-based lifespan
PeerDB • PeerDB is built on top of BestPeer • Four components that are integrated and implemented on the application layer. • Data management system • Facilitates storage, manipulation and retrieval of the data • MySQL as the backend for supporting SQL query facility • Local Dictionary • Metadata stored in Local Dictionary • Export Dictionary • Metadata sharable to other nodes • Cache Manager • Caching remote data in secondary storage • Caching/replacement policy • B.C. Ooi, K.L. Tan, A. Zhou, C.H. Goh, Y.G. Li, C.Y. Liau, B. Ling, W.S. Ng, Y. Shu, X.Y. Wang, M. Zhang: PeerDB: Peering into Personal Databases. SIGMOD’2003, Demo. • W.S. Ng, B. C. Ooi, K.L. Tan, A. Zhou: PeerDB: A P2P-based System for Distributed Data Sharing. ICDE’2003