300 likes | 316 Views
P2P Architecture Case Study: Gnutella Network. Matei R î peanu The University of Chicago. Why analyze Gnutella network?. Unprecedented scale up to 100k nodes, 100TB data, 10M files today Self-organizing network Staggering growth more than 50 times during first half of 2001
E N D
P2P Architecture Case Study:Gnutella Network Matei Rîpeanu The University of Chicago
Why analyze Gnutella network? • Unprecedented scale • up to 100k nodes, 100TB data, 10M files today • Self-organizing network • Staggering growth • more than 50 times during first half of 2001 • Open architecture, simple and flexible protocol • Interesting mix of social and technical issues
Overview • Gnutella protocol • Tools for exploring the network • Network growth • Structural graph analysis • Is Gnutella a power-law network? • Generated (overhead) network traffic • Traffic estimates • Overlay network topology mapping
Gnutella protocol overview • P2P file sharing application on top of an overlay network • Nodes maintain open TCP connections • Messages are broadcasted (flooded) or back-propagated • Protocol:
A Gnutella search mechanism • Steps: • Node 2 initiates search for file A 7 1 4 2 6 3 5
A A A Gnutella search mechanism • Steps: • Node 2 initiates search for file A • Sends message to all neighbors 7 1 4 2 6 3 5
A A A A Gnutella search mechanism • Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message 7 1 4 2 6 3 5
A A A A:5 A:7 Gnutella search mechanism • Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message 7 1 4 2 6 3 5
A A A:5 A:7 Gnutella search mechanism • Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is back-propagated 7 1 4 2 6 3 5
A:5 A:7 Gnutella search mechanism • Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is back-propagated 7 1 4 2 6 3 5
Gnutella search mechanism • Steps: • Node 2 initiates search for file A • Sends message to all neighbors • Neighbors forward message • Nodes that have file A initiate a reply message • Query reply message is back-propagated • File download download A 7 1 4 2 6 3 5
Tools for network exploration • Eavesdropper- insert modified nodes into the network to eavesdrop traffic. • Crawler- connects to all active nodes and uses the membership protocol to discover graph topology. • Client-server approach. • Graph analysistools • high-volume offline computations.
Network growth • High user interest • Users tolerate high latency, low quality results • Better resources • DSL and cable modem nodes grew from 24% to 41% over first 6 months. Today >50%. • Open architecture / open-source environment • Competing implementations • Lower overhead network traffic, improved resource utilization, better structure
Growth invariants (1): avg. node connectivity • 3.4 links per node on average
Growth invariants (2): network diameter • Node-to-node distance maintains similar distribution • Average node-to-node distance grew 25% while the network grew 50 times over 6 months
Is Gnutella a power-law network? Power-law networks: the number of links per node follows a power-law distribution Examples: • the Internet, • in/out links to/from HTML pages, • citation network, • US power grid, • social networks. November 2000 Implications: High tolerance to random node failure but low reliability when facing of an ‘intelligent’ adversary
Is Gnutella a power-law network? • Later, larger networks display a bimodal distribution • Implications: • High tolerance to random node failures preserved • Increased reliability when facing an attack. May 2001
Overview • Gnutella protocol • Network growth • Structural graph analysis • Generated network traffic: • Traffic estimates • Does Gnutella overlay network topology match the underlying resources.
Trafficanalysis • 6-8 kbps per link over all connections • Traffic structure changed over time
Total generated traffic 1Gbps (or 330TB/month)! • Compare to 15,000TB/month in US Internet backbone (Dec. 2000) • Note that this estimate excludes actual file transfers • Q: Does it matter? Reasoning: • QUERYandPINGmessages are flooded. They form more than 90% of generated traffic • predominant TTL=7 • >95% of nodes are less than 7 hops away • measured traffic at each link about 6kbs • network with 50k nodes and 170k links
Topology mismatch The overlay network topology doesn’t match the underlying Internet infrastructure topology! • 40% of all nodes are in the 10 largest Autonomous Systems (AS) • Only 2-4% of all TCP connections link nodes within the same AS • Largely ‘random wiring’ • Entropy experiment gives similar results
Conclusions • Gnutella: self-organizing, large-scale, P2P application based on overlay network. It works! • Growth hindered by the volume of generated traffic and inefficient resource use. • Discovered growth invariants specific to large-scale systems that: • Help predict resource usage • Give hints for better search and resource organization techniques.
Thankyou! Questions?
What’s next? • Organize the overlay network to match the underlying infrastructure topology. • Investigate methods for reducing traffic (query routing/filtering, better information organization). • Is Gnutella network a small-world network? What are the implications?
Statistical laws of large-scale systems • Zipf’s law: the size of the rth largest occurrence of the event is inversely proportional to it's rank: y ~ r -b, with b close to unity. • Power law distributions: Probability distribution of event X is P[X=x]=x-k • Pareto distribution: Cumulative probability distribution P[X>x]=x–(k-1) =x– Zipf, Pareto and power-law distributions are basically different ways to express the same phenomenon
F F A A E E B B D D G G C C H H F F A A E E B B G G D D C C H H
Overview • Gnutella protocol • Network growth • Statistical properties of large-scale systems • Power-law distributions. • Power-law networks. • Generated (overhead) network traffic.
Power-law distributions Probability distribution of event X is P[X=x]=x–k Present all over WWW and Internet space: the number of HTML pages within a site, visits to a site, links to a page, cache document popularity, etc
Power-law distributions in Gnutella • Number of shared files per node • Query popularity follows a power-law distribution [Kas01] • Implications: • Caching is an effective solution to reduce traffic and query latency • New search and node organizing mechanisms!