300 likes | 534 Views
Scalability. Optimizing P2P Networks: Lessons learned from social networking Social Networks Lessons Learned Organizing P2P Networks Node Topologies Centralized, Ring, Hierarchical & Decentralized Hybrid: Centralized-Ring Centralized-Centralized Centralized-Decentralized
E N D
Scalability • Optimizing P2P Networks: Lessons learned from social networking • Social Networks • Lessons Learned • Organizing P2P Networks • Node Topologies • Centralized, Ring, Hierarchical & Decentralized • Hybrid: • Centralized-Ring • Centralized-Centralized • Centralized-Decentralized • Reflector Nodes • Gnutella Case Studies • 3 case studies • DHTs • what are they? • example
Social Networks Boston Omaha • Stanley Milgram(not a Harvard professor) – 1967 social networking experiment • How many ‘social hops’ would it take for messages to traverse through the US population (200 million) • Posted 160 letters to randomly recruited people in Omaha, Nebraska • Asked them to try to pass these letters to a stockbroker working in Boston, Massachusetts • Rules: • use intermediaries whom they know on a first name basis • chosen intelligently • make a note at each hop • 42 letters made it one version of the experiment • Average of 5.5 hops • Demonstrated the ‘small world effect’ Suggests that the social network of the United States is indeed connected with a path-length (number of hops) of around 6 – The 6 degrees of separation ! Does this mean that it takes 6 hops to traverse 200 million people??
Lessons Learned from Milgrim’s Experiment • Social circles are highly clustered • A few members have wide-ranging connections • these form a bridge between far-flung social clusters • this bridging plays a critical role in bringing the network closer together • For example • A quarter of all letters passed through a local storekeeper • A half were mediated by just 3 people • Lessons Learned • These people acted as gateways or hubs between the source and the wider world • A small number of bridges dramatically reduces the number of hops
From Social Networks toComputer Networks… • There are a number of similarities to social networks • People = peers • Intermediaries = Hubs, Gateways or Rendezvous Nodes (JXTA speak...) • Number of intermediaries passed through = number of hops • Are P2P Networks Special then? • P2P networks are more like social networks than other types of computer network because they are often: • Self Organizing • Ad-Hoc • Employ clustering techniques based on prior interactions (like we form relationships) • Decentralized discovery and communication (like we form neighbourhoods, villages, cities etc) • What about social networking sites? • huge – “If Facebook were a country, it would be the eighth most populated in the world, just ahead of Japan, Russia and Nigeria.” • But the application overlay network does not reflect social network • Use centralized data centers.
Peer to Peer: What’s the problem? • Problem: how do we organize peers within ad-hoc, multi-hop pervasiveP2P networks? • network of self-organizing peers organized in a decentralized fashion • such networks can rapidly expand from a few hundred peers to several thousand or even millions • P2P Environment Recap: • Unreliable Environments • Peers connecting/disconnecting – network failures to participation • Random Failures e.g. power outages, Cable, DSL failure, hackers • Personal machines are much more vulnerable than servers • algorithms have to cope with this continuous restructuring of the network core. • P2P systems need to treat failures as normal occurrences not freak exceptions • must be designed in a way that promotes redundancy with the tradeoff of a degradation of performance
So, how do we Organize Networks inOrder to Get Optimum Performance? • For P2P • This does not mean abstract numerical benchmarks e.g. how many milliseconds will it take to compute this many millions of FFTs? • Rather, it means asking question like: • How long will it take to retrieve this particular file? • How much bandwidth will this query consume? • How many hops will it take for my package to get to a peer on the far side of the network? • If I add/remove a peer to the network will the network still be fault tolerant? • Does the network scale as we add more peers?
Performance Issues in P2P Networks 3 main factors that make P2P networks more sensitive to performance issues: • Communication. • Fundamental necessity • Users connected via different connection speeds • Multi-hop • 2.Searching • No central Control so more effort is needed • Each hop adds to total bandwidth • 3.Equal Peers • Free Riders –imbalance in the harmony of network • Degrades performance for others • Need to get this right and adjust accordingly
Peer Topologies • Core • Centralized • Ring • Hierarchical • Decentralized • Hybrid • Centralized-Ring • Centralized-Centralized • Centralized-Decentralized
Centralized • Client/server • Web servers • Databases • Napster search • Instant Messaging
Ring • Failover clusters • Simple load balancing • Assumption • Single owner • co-ordination
Hierarchical • Tree structure • DNS • www.example.com
Decentralized • Gnutella • Freenet • Internet routing
Centralized + Ring • Robust web applications • High availability of servers
Centralized + Centralized • N-tier apps • Database heavy systems • Web services gateways • Google.com uses this topology to deliver their search engine
Centralized + Decentralized • New Wave of P2P • Clip2 Gnutella Reflector (next) • FastTrack • KaZaA • Morpheus • Email • Like Social Networks perhaps ?
Reflector Nodes C F1.mp3 0 F1.mp3 – ID0:F1.mp3 … F2.mp3 1 F3.mp3 2 • Known as ‘super peers’ – in JXTA these are Rendezvous peers • cache file list of connected users – maintain an index • When a query is issued, the Reflector does not retransmit it - it answers the query from its own memory • Do they remind you of anything ?
Napster = Gnutella? N3 User N2 Napster Gnutella Super Peers: Napster Duplicated Servers Gnutella Napster User Napster.com =? 1. Natural?? 2. Reflector (clip2.com)
The Gnutella Network The figure below is a view of the topology of a Gnutella network as shown on the LimeWire web site, the popular Gnutella file-sharing client. Notice how the power-law or centralized-decentralized structure is demonstrated.
Gnutella Studies 1: Free Riding E. Adar and B.A. Huberman (2000), “Free Riding on Gnutella,” First Monday 5(10), http://firstmonday.org/issues/issue5_10/adar/index.html Two types of free riding • download files but never provide any files for other to download • users that have undesirable content • They found 22,084 of the 33,335 peers in the network (66%) of the peers share no files • 24,347 or 73% share ten or less files • top 1 percent (333 hosts) represent 37 percent of the total files shared • 20 percent (6,667 hosts) sharing 98% of the files shows - even without Gnutella Reflector nodes, the Gnutella network naturally converges into a centralized + decentralized topology with the top 20% of nodes acting as super peers or reflectors
Gnutella Studies 2: Equal Peers Study on Reflector Nodes [clip] www.clip2.com Studied Gnutella for one month • Noted an apparent scalability barrier when query rates went above 20 per second. Why?? • In a network of roughly 1000 nodes, a servent must handle up to 20 queries per second. • a dial-up 56-K link cannot keep up with this amount of traffic • one node connected in the incorrect place can grind the whole network to a halt because it becomes a dead end • The network fragments. • This is why P2P networks place slower nodes at the edges
Gnutella Studies 3: Communication Peer-to-Peer Architecture Case Study: Gnutella Network Matei Ripeanu, on-line at: http://people.cs.uchicago.edu/~matei/PAPERS/P2P2001.pdf Studied topology of Gnutella over several months & reported two findings: • Gnutella network shares the benefits and drawbacks of a power-law structure • - networks that organize themselves so that most nodes have a few links and a small number of nodes have many • - found to show an unexpected degree of robustness when facing random node failures. • - vulnerable to attacks e.g. by removing a few of the super nodes can have a massive effect on the function of the network as a whole. • Gnutella network topology does not match well with the underlying Internet topology leading to inefficient use of network bandwidth. • He gave 2 suggestions: • use an agent to monitor network and intervene by asking servents to drop/add links to keep the topology optimal. • replace the Gnutella flooding mechanism with a smarter routing and group communication mechanism.
Gnutella Studies • Gnutella shows properties associated with power-law distribution • (e.g., a node with twice the connections is four times less frequent) • Power-law distributions happen all over the place in nature and society: • word frequency distribution • Sizes of meteorites and sand particles • Sizes of cities • the Pareto principle (80 – 20 rule) - 20% of the population own 80% of the wealth • Zipf distribution (and Zipf-Mandelbrot) – Mandelbrot coined the term fractal • And has re-emerged recently as The Long Tail on the Web.
Scalability Through Structure • Gnutella, Kazaa can be classified as ‘unstructured’ networks • interconnection of nodes is ad-hoc, highly dynamic, defined independently by each node according to individual requirements. • settles into a topology with qualities associated with power-law distribution. • A class of P2P systems that are known as ‘structured’ evolved just after the millennium. • Chord • CAN • Pastry • Tapestry • Generally a form of Distributed Hash Table (DHT)
What are DHTs? • A DHT is a topology that provides similar functionality to a typical hash table. • put(key, value) • get(key) • Peers are buckets in the table • with their own local hash tables • Allows a peer to publish a resource onto a network using a key to determine where the data will be stored (i.e. which peer will receive the data). • Using keys presupposes a logical ‘space’ which the keys map onto. • The key is mapped to the space using a hashing function to ensure equal distribution of resources across the network. • Nodes are responsible for sections of this space.
Why DHTs? • Address the flooding issue without resorting to centralized/decentralized architecture. • Typically search can be achieved in O(logn) hops where n is the number of nodes in the network. • only a few neighbors need to be known – typically O(logn) • small neighborhoods and flat topology makes for a robust network, easy to handle churn.
Example: Chord Topology • Divides the key space into a circle • keys are n-bit sized • ring can contain up to 2n nodes • keyscan range from 0 to 2n – 1 • Consistent hashing algorithm (e.g. MD5) is used to evenly distribute keys around the ring. • increases probability of robustness • allows nodes to join and leave without disrupting the network • O(1/n) fraction of keys are moved to a different location • Node IDs are distributed based on the key size and the number of nodes in the network. • A node should be responsible for keys/nodes keys
Chord Finger Tables • Just knowing your precursor and successor leads to very bad performance • O(n) hops to find a key (O(n/2) expected) • Chord nodes have a routing (finger) table containing approx. O(logn) nodes • The distance of nodes in the table increases exponentially • having this many nodes in the finger table means O(logn) hops are needed to find the key • For each query for key k there is choice of O(logn) nodes. Choose the one whose id is closest to k
DHT Issues • DHTs are structured. Maintaining the structure has overhead. • presupposes equal capabilities in nodes • NOT a power-law distribution • not always possible to have fuzzy or attribute-based queries. It’s a lookup facility – you need to know the key • Searching on a Gnutella network is open ended. • you may get results, you may not. • DHT algorithms are deterministic and designed for lookup • So data going missing is more problematic • replication needs to be employed to ensure data availability
Closing Remarks • Summary • Centralized + Decentralized – understand from the original Gnutella to the new models • The role of Reflector nodes • Structured topologies (DHTs) – efficient lookup without Centralization