460 likes | 612 Views
CS 525 Advanced Topics in Distributed Systems Spring 08. Indranil Gupta Characteristics of P2P systems. S. S. S. P. P. P. P. P. P. Napster . 2. All servers search their lists (ternary tree algo.). Store peer pointers for all files. napster.com Servers. Peers. 3. Response.
E N D
CS 525 Advanced Topics in Distributed SystemsSpring 08 Indranil Gupta Characteristics of P2P systems
S S S P P P P P P Napster 2. All servers search their lists (ternary tree algo.) Store peer pointers for all files napster.com Servers Peers 3. Response 1. Query Store their own files 4. ping candidates 5. download from best host
P P P P P P P Gnutella Query’s flooded out, ttl-restricted, forwarded only once Who has PennyLane.mp3?
S S P P P P P Kazaa Peers Supernodes
What are the Characteristics of these Systems in Real-life Settings? • Collect traces • Tabulate them • Papers contain plenty of information on how data was collected, the caveats, ifs and buts of the interpretation, etc. • These are important, but we will ignore them for this lecture and concentrate on the raw data and conclusions • We’ll focus mostly on Gummadi et al and Chu et al, and touch on the PPLive results at the end
Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload Gummadi, Saroui et al Department of Computer Science University of Washington
Three-tiered approach • 2003 paper analyzed 200-day trace of Kazaa traffic • Considered only traffic going from U. Washington to the outside • Developed a model of multimedia workloads • Analyzed and confirm hypotheses • Understood locality-awareness in Kazaa
Contributions • Obtained some useful characterizations of Kazaa’s traffic • Showed that Kazaa’s workload is not Zipf • Showed that other workloads (multimedia) may not be Zipf either • Presented a model of P2P file-sharing workloads based on their trace results • Validated the model through simulations that yielded results very similar to those from traces • Proved the usefulness of exploiting locality-aware request routing
Measurement Findings • Users are patient • Users slow down as they age • Kazaa is not one workload • Kazaa clients fetch objects at-most-once • Popularity of objects is often short-lived • Kazaa is not Zipf
User characteristics (1) • Users are patient
User characteristics (2) • Users slow down as they age • clients “die” • older clients ask for less each time they use system
User characteristics (3) • Client activity • Tracing used could only detect users when their clients transfer data • Thus, they only report statistics on client activity, which is a lower bound on availability • Avg session lengths are typically small (median: 2.4 mins) • Many transactions fail • Periods of inactivity may occur during a request if client cannot find an available server with the object
Object characteristics (1) • Kazaa is not one workload • This does not • account for • connection overhead
Object characteristics (2) • Kazaa object dynamics • Kazaa clients fetch objects at most once • Popularity of objects is often short-lived • Most popular objects tend to be recently-born objects • Most requests are for old objects (> 1 month) • 72% old – 28% new for large objects • 52% old – 48% new for small objects
Object characteristics (3) • Kazaa is not Zipf • Zipf’s law: popularity of ith-most popular object is proportional to i-α, (α: Zipf coefficient) • Web access patterns are Zipf: small number of objects are extremely popular, but there is a long tail of unpopular requests. • (Zipf) looks linear on log-log scale Caveat: what is an “object” in Kazaa?
Model of P2P file-sharing workloads [?] Why a model? • On average, a client requests 2 objects/day • P(x): probability that a user requests an object of popularity rank x Zipf(1) • Adjusted so that objects are requested at most once • A(x): probability that a newly arrived object is inserted at popularity rank x Zipf(1) • All objects are assumed to have same size • Use caching to observe performance changes (effectiveness hit rate)
Model – Simulation results • File-sharing effectiveness diminishes with client age • System evolves towards one with no locality and objects chosen at random from large space • New object arrivals improve performance • Arrivals replenish supply of popular objects • New clients cannot stabilize performance • Cannot compensate for increasing number of old clients • Overall bandwidth increases in proportion to population size
Model validation • By tweaking the arrival rate of of new objects, were able to match trace results (with 5475 new arrivals per year)
Exploring locality-awareness • Currently organizations shape or filter P2P traffic • Alternative strategy: exploit locality in file-sharing workload • Caching; or • Use content available within organization to substantially decrease external bandwidth usage • Result: 86% of externally downloaded bytes could be avoided by using an organizational proxy
Some Questions for You • Most requests for old objects, while most popular objects are new ones – is there a contradiction? • “Unique object” : When do we say two objects A and B are “different”? • When they have different file names • fogonthetyne.mp3 and fogtyne.mp3 • When they have exactly same content • 2 mp3 copies of same song, one at 64 kbps and the other at 128 kbps • When A (and not B) is returned by a keyword search, and vice versa • …? • Based on this, does “caching” have a limit? Should caching look into file content? Is there a limit to such intelligent caching then? • Should there be separate overlays for small objects and large objects? For new objects and old objects? • Or should there be separate caching strategies?
Availability and Locality Measurements of Peer-to-Peer Systems Chu, Labonte, and Levine Department of Computer Science University of Massachusetts, Amherst
Goals • Measurement study of peer-to-peer (P2P) file sharing applications • Napster (January 2001) • Gnutella (March 2002) • Study collected data to analyze • Locality of files • Distribution of file size and types • Availability
Experiment Methodology • Initially discover a large set of users (nodes) • Ping nodes periodically • If node is available, gather list of shared files • Save timestamp of each list • File-ListNew – File-ListOld = File-ListDownloaded
Problem #1: Filenames • Replicas of same exact files have different names. Example: • Smashing Pumpkins - Tonight, Tonight. • Smashing Pumpkins - Tonight • Smashing Pumpkins Tonight Tonight • Tonight Tonight Smashing Pumpkins
Solution #1: Filenames • Drop stop words, e.g. “and” and “the” • Take out repeating letters (collins -> colins) • Drop vowels (colins -> clns) • Change non-alphanumeric characters to space • Drop leading white space • Sort the space-delimited name to obtain signature
Solution #1: Limitations • Dropping part of a name? • Smash Pump Tonight Tonight • Changing words • Smashing Pumpkins 2nite 2nite • Adding additional information • Smashing Pumpkins [07] Tonight Tonight
Napster Details • Data collected between December 21, 2000 and February 2, 2001 (before its court case ended) • Custom client based on Napster protocol • Search query based on random dictionary words to gather initial list of users • Another client sent “BROWSE” message to the list
Gnutella Details • Data collected between February 24, 2002 and March 25, 2002 • Custom client based on JTella API • Examined passing QueryHit messages. They contain GUID, IP and port • Get list of shared files from supported clients: Bearshare and SwapNut • To monitor availability rapidly, use nmap
Problem #2: IP Address • As many Internet users utilize DHCP, their IP address changes frequently. • On the other hand, many users might share the same IP address because of NAT
Solution #2: IP Address • Ignore it. • Assign users unique id’s • [Bhagwan et al] paper – this makes a difference as far as availability measurements are concerned
File Locality Not Zipf, although heavy-tailed “Log-quadratic”
File Transfer Locality Better locality than storage Closer to Zipf than file storage distbn. Not Zipf, although heavy-tailed “Log-quadratic” Different from Kazaa
Demographics of Stored Files Dominates the distbn. Different from Kazaa. May indicate a need to change p2p DHT depending on workload.
Node Availability Why is there a difference betw. the blue and the red curves?
Conclusions • High locality of file storage, higher even in file transfers • Strong diurnal patterns • Constant churn • Short session lengths
Discussion • Does it make sense to delegate no responsibility (overlay links, metadata etc.) at all to nodes that availability below a threshold? • Should there be separate overlays for small and large files? For more popular versus less popular files? • Given a limited caching space, should a node choose a popular file (more requests, fewer popular files) or a large file (more request bytes) or old file (greater percentage of requests)? • Or should it store a small chunk from each? • When should a cached item be deleted? • Replication strategies
PPLive Results Largest IPTV (P2P streaming system) in the world today: 500K users at peak, multiple channels and per-channel overlay, nodes may be recruited as relays for other channels. (Data from 2006) Results different from File-sharing P2P overlays: • Users are impatient: Session times are small, and exponentially distributed (think of TV channel flipping!) • Smaller overlays are random (and not power-law or clustered) • Availability is correlated across nodes (that appear in the same snapshot)! (Why? Think of Anysee…) • Channel population varies by 9x over a day.
Discussion • Churn rates: different in different p2p systems? • Can you build systems that: • Adapt to time of day? • Adapt to session time (distributions) varying? • Adapt to varying system size (#online nodes)? • Adapt to varying structure of overlay? • Adapt to different workloads? • Does it make sense having multiple overlays, one for each workload?