Search and Replication in Unstructured Peer-to-Peer Networks

Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao Cisco Systems, Inc. (Joint work with Christine Lv, Edith Cohen, Kai Li and Scott Shenker)

Disclaimer • Results, statements, opinions in this talk do not represent Cisco in anyway • This talk is about technical problems in networking, and does not discuss moral, legal and other issues related to P2P networks and their applications

Outline • Brief survey of P2P architectures • Evaluation methodologies • Search methods • Replication strategies and analysis • Simulation results

Characteristics of Peer-to-Peer Networks • Unregulated overlay network • Current application: file swapping • Dynamic: nodes join or leave frequently • Example systems: • Napster, Gnutella; • Freenet, FreeHaven, MajoNation, Alpine, ... • JXTA, Ohaha, … • Chord, CAN, “Past”, “Tapestry”, Oceanstore

Architecture Comparisons • Napster: centralized • A central website to hold file directory of all participants; Very efficient • Scales • Problem: Single point of failure • Gnutella: decentralized • No central directory; use “flooding w/ TTL” • Very resilient against failure • Problem: Doesn’t scale

Architecture Comparisons • Various research projects such as CAN: decentralized, but “structured” • CAN: distributed hash table • “Structure”: all nodes participate in a precise scheme to maintain certain invariants • Extra work when nodes join and leave • Scales very well, but can be fragile

Architecture Comparisons • FreeNet: decentralized, but semi-structured • Intended for file storage • Files are stored along a route biased by hints • Queries for files follow a route biased by the same hints • Scales very well • Problem: would it really work? • Simulation says yes in most cases, but no proof so far

Our Focus: Gnutella-Style Systems • Advantages of Gnutella: • Support more flexible queries • Typically, precise “name” search is a small portion of all queries • Simplicity, high resilience against node failures • Problems of Gnutella: Scalability • Bottleneck: interrupt rates on individual nodes • Self-limiting network: nodes have to exit to get real work done!

Evaluation Methodologies Simulation based: • Network topology • Distribution of object popularity • Distribution of replication density of objects

Evaluation Methods • Network topologies: • Uniform Random Graph (Random) • Average and median node degree is 4 • Power-Law Random Graph (PLRG) • max node degree: 1746, median: 1, average: 4.46 • Gnutella network snapshot (Gnutella) • Oct 2000 snapshot • max degree: 136, median: 2, average: 5.5 • Two-dimensional grid (Grid)

Modeling Methods • Object popularity distribution pi • Uniform • Zipf-like • Object replication density distribution ri • Uniform • Proportional: ri pi • Square-Root: ri  pi

Evaluation Metrics • Overhead: average # of messages per node per query • Probability of search success: Pr(success) • Delay: # of hops till success

Load on Individual Nodes • Why is a node interrupted: • To process a query • To route the query to other nodes • To process duplicated queries sent to it

Duplication in Flooding-Based Searches • Duplication increases as TTL increases in flooding • Worst case: a node A is interrrupted by N * q * degree(A) messages 1 3 2 4 6 5 7 8 . . . . . . . . . . . .

Duplications in Various Network Topologies

Relationship between TTL and Search Successes

Problems with Simple TTL-Based Flooding • Hard to choose TTL: • For objects that are widely present in the network, small TTLs suffice • For objects that are rare in the network, large TTLs are necessary • Number of query messages grow exponentially as TTL grows

Idea #1: Adaptively Adjust TTL • “Expanding Ring” • Multiple floods: start with TTL=1; increment TTL by 2 each time until search succeeds • Success varies by network topology • For “Random”, 30- to 70- fold reduction in message traffic • For Power-law and Gnutella graphs, only 3- to 9- fold reduction

Limitations of Expanding Ring

Idea #2: Random Walk • Simple random walk • takes too long to find anything! • Multiple-walker random walk • N agents after each walking T steps visits as many nodes as 1 agent walking N*T steps • When to terminate the search: check back with the query originator once every C steps

Search Traffic Comparison

Search Delay Comparison

Lessons Learnt about Search Methods • Adaptive termination • Minimize message duplication • Small expansion in each step

Flexible Replication • In unstructured systems, search success is essentially about coverage: visiting enough nodes to probabilistically find the object => replication density matters • Limited node storage => what’s the optimal replication density distribution? • In Gnutella, only nodes who query an object store it => ri pi • What if we have different replication strategies?

Optimal ri Distribution • Goal: minimize ( pi/ ri ), where  ri =R • Calculation: • introduce Lagrange multiplier , find ri and  that minimize: ( pi/ ri ) +  * ( ri - R) =>  - pi/ ri2 = 0 for all i => ri  pi

Square-Root Distribution • General principle: to minimize ( pi/ ri ) under constraint  ri =R, make ri propotional to square root of pi • Other application examples: • Bandwidth allocation to minimize expected download times • Server load balancing to minimize expected request latency

Achieving Square-Root Distribution • Suggestions from some heuristics • Store an object at a number of nodes that is proportional to the number of node visited in order to find the object • Each node uses random replacement • Two implementations: • Path replication: store the object along the path of a successful “walk” • Random replication: store the object randomly among nodes visited by the agents

Evaluation of Replication Methods • Metrics • Overall message traffic • Search delay • Dynamic simulation • Assume Zipf-like object query probability • 5 query/sec Poisson arrival • Results are during 5000sec-9000sec

Distribution of ri

Total Search Message Comparison • Observation: path replication is slightly inferior to random replication

Search Delay Comparison

Summary • Multi-walker random walk scales much better than flooding • It won’t scale as perfectly as structured network, but current unstructured network can be improved significantly • Square-root replication distribution is desirable and can be achieved via path replication

Search and Replication in Unstructured Peer-to-Peer Networks

Search and Replication in Unstructured Peer-to-Peer Networks

Presentation Transcript

Peer-To-Peer Networks

Replication Strategies in Unstructured Peer-to-Peer Networks

Peer-to-Peer Networks

Self Regulated Search in Unstructured Peer-to-Peer Networks

Efficient Search in Peer to Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks

Replication Strategies in Unstructured Peer-to-Peer Networks

Replication Strategies in Unstructured Peer-to-Peer Networks

Search and Replication in Unstructured Peer-to-Peer Networks

Search and Replication in Unstructured P2P Networks

Improving Search in Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks

A Dynamic Routing Protocol for Keyword Search in Unstructured Peer-to-peer Networks

Peer-to-Peer Networks

Peer-to-Peer Networks

Peer-to-peer networks

Replication Strategies in Unstructured Peer-to-Peer Networks

Efficient Search in Peer to Peer Networks