1 / 23

Characterizing Files in the Modern Gnutella Network: A Measurement Study

Characterizing Files in the Modern Gnutella Network: A Measurement Study. Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information Science Department University of Oregon http://mirage.cs.uoregon.edu. Introduction.

denton
Download Presentation

Characterizing Files in the Modern Gnutella Network: A Measurement Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information Science Department University of Oregon http://mirage.cs.uoregon.edu Multimedia Computing & Networking 2006

  2. Introduction • P2P applications are very popular over the Internet • File-sharing: Gnutella, Kazza, eDonkey • Content distribution: BitTorrent • IP telephony: Skype • P2P applications remain popular because of • Ease of deployment, self-scaling, infrastructure-less • Significant impact on the Internet • Characterizing P2P applications is essential for • Evaluating their performance and improving their designs • Conducting meaningful simulations and analytical study • Examining their impact on the network • Characteristics of large scale P2P applications are not well understood! Multimedia Computing & Networking 2006

  3. P2P Systems: An Overview (I) • Theme: enabling a group of peers (computers) to share their resources (e.g. file, bandwidth, storage, CPU) • As participating peers arbitrarily join & leave, they form an (application level) overlay topology. • Overlay is inherently dynamic • No especial support from the network (e.g. multicast) • Overlay is used for resource discovery, management Multimedia Computing & Networking 2006

  4. P2P Systems – Overview (II) • Inherent properties: • Scalability: available resources organically grows with the number of peers • Churn: peers voluntarily join/leave • Heterogeneity: peers have different capabilities • Two basic architectures: • Unstructured: peers form a randomly connected overlay 2) Structured: peers form an overlay with certain properties (ring, tree) Multimedia Computing & Networking 2006

  5. Effect on the Internet • 60% of all Internet traffic [CacheLogic Research 2005] • Some P2P apps have millions of simultaneous users. • Geographically distributed. Gnutella overlay in 2002 Gnutella population (Oct 04 – Jan 06) Multimedia Computing & Networking 2006

  6. Research on P2P Networking • Active area of research since 2001 • Mostly focusing on new architectures, new resource discovery/management techniques • Evaluation is only feasible through simulation or small scale experiments with synthetic workloads. • Few empirical studies on P2P systems • Characteristics of widely-deployed P2P systems are not well understood. • Peer dynamics: e.g. dist of peer uptime • Overlay properties: e.g. dist of peer degree • Resource properties: e.g. popularity dist of files Multimedia Computing & Networking 2006

  7. Methodology • Characterizing P2P applications requires capturing system “snapshots”. • Snapshot is a graph that represents state of the system at a given point of time (peers = nodes, connections = edges). • Individual snapshots reveal instantaneous properties. • Consecutive snapshots reveal dynamics. • Ideally, a snapshot is captured instantaneously. • In practice, a snapshot is iteratively discovered by a P2P crawler. • P2P apps should provide support for crawler, e.g. query a peer for list of neighbors, files. • It is difficult to characterize proprietary P2P applications. Multimedia Computing & Networking 2006

  8. Cruiser: a Fast P2P Crawler • We developed a parallel crawler, called Cruiser. • Features: • Master-slave architecture, master coordinates among slaves, each slave crawls hundred peers simultaneously • Dynamic adaptation to bandwidth & CPU constraints • Generic crawler, accommodates plug-ins • Orders of magnitude faster than other P2P crawlers: • Captures one million Gnutella nodes in around 7 minutes • 140K peers/min (visiting 22K peers/min) >> 2.5 peers/min • Lots of important implementation issues: • Setting timeout, no of file-descriptors per process, dealing with local NAT box Multimedia Computing & Networking 2006

  9. Cruiser/ Evaluating Snapshot Accuracy • No ref. snapshot to compare • Completeness of captured snapshots: edges, nodes • Tradeoff between granularity & completeness of snapshots • Node distortion > 4% • Edge distortion > 15% • 30% of peers are unreachable • 3% departed peer • 17% behind firewall (NAT) • 10% overloaded !! Peers discovered (*10,000) Multimedia Computing & Networking 2006

  10. Characterizing Files/ Previous Studies • Captured a small population of peers • Partial snapshot through a short crawl • Periodic probe of a fixed group of peers • Have not verified whether the captured population is representative • Conducted more than 3 years ago (outdated) • Population of these apps has significantly grown • New features & two-tier arch. were incorporated Multimedia Computing & Networking 2006

  11. Characterizing Files/ Measurement Methodology • Characterizing files requires file snapshots. • Obtaining the list of shared files & neighbor info. from individual peers • a content crawl + a topolgy crawl • Individual snapshots reveal static & topological analysis. • Consecutive snapshots reveal dynamic analysis. • Topology crawl is much faster than content crawl (minutes vs hours) • Other challenges: NAT, DHCP, fileID, …(see paper). • Minimizing the distortion in file snapshots by • Capturing a complete snapshot with a high-speed crawler • Decoupling topology crawl from content crawl Ultrapeer Top-level overlay Topology crawl Topology crawl Content Crawl 5.5 hours 15 min 15 min Leaf Multimedia Computing & Networking 2006

  12. Characterizing Files/ Dataset • Captured around 50 snapshots • Average log size/snapshot: 10GByte • Each snapshot represents • 800 Terabyte content • 100 million unique files • 0.5 million reachable peers, 20% of identified peers • Available content in Gnutella = 4,000 Terabytes • Reported results were consistent across multiple snapshots • Post processing • e.g. Removed duplicate files reported by individual peers (9% of all captured files) Multimedia Computing & Networking 2006

  13. Characterizing Files/ Summary of Characterizations 1) Static analysis: characteristics of files at a given point of time 2) Topological analysis: correlation between file distribution and overlay topology 3) Dynamics analysis: changes in file characteristics over time Multimedia Computing & Networking 2006

  14. Characterizing Files/Static Analysis Free Riding Free Riders • % of free riders reported in previous studies • 66% in 2000 [Adar] • 25% in 2002 [Saroiu] • % of free riders have dropped Peers None Files 352 159K 12% Ultra 235K 15% 332 Leaf 125K 12% 349 Long-lived Ultra 34K 12% 363 Short-lived Ultra 156K 350 16% Long-lived leaf 79K 297 14% Short-lived Leaf 340 394K 14% total June 13, 2005 [rounded numbers] Multimedia Computing & Networking 2006

  15. Characterizing Files/Static Analysis Resource Sharing • How much resources (files, storage) peers contribute? • Dist. of peers contributing: • x files conforms power-law • x MByte conforms power-law • Most peers contribute little, but few contribute a lot • Shared files vs storage • Not as strong as reported by Saroiu et al. 2002 Multimedia Computing & Networking 2006

  16. Characterizing Files/Static Analysis File Popularity • Representing availability of individual files. • Follows Zipf distribution • Popularity distribution remains stable over time Multimedia Computing & Networking 2006

  17. Characterizing Files/Static Analysis File Types Major Audio Types File% Byte% Type mp3 61% 37% • in 2001, chu et al. reported • Audio: 67% of files, 79% of bytes • Video: 2% of files, 19% of bytes • mp3 files are very popular! • mm files make up: 73% files, 93% bytes • Non-mm: jpg, gif, htm, exe, txt • Video files become more popular wma 2.7% 1.3% wave 1.9% 0.7% m4a 1.4% 0.7% total 67% 40% Major Video Types File% Byte% Type wmv 2.3% 3.4% mpg 2.4% 23.3% avi 0.8% 24.5% asf .14% 0.64% total 5.6% 52% Multimedia Computing & Networking 2006

  18. Characterizing Files/ Topological Analysis • Is there any correlation between locations of a file and overlay topology? • i.e. Are copies of a file topologically clustered? • File locations are affected by two factors: 1) Scoped search => topological clustering 2) Churn => random distribution • Which factor is dominant? • Examining from two angles: • Per-file perspective • Per-peer perspective Multimedia Computing & Networking 2006

  19. Characterizing Files/ Topological Analysis • Simulate flood-based query from 100 random peers • No of messages to find 5 copies • Files with different popularity • Random vs realistic file distr. • Average similarity of content between 100 random peers with one/two/three-hop neighbors. • No topological clustering exists • Churn is the dominant factor • Use random file dist. for sim • Select random peers to characterize files (non trivial) Multimedia Computing & Networking 2006

  20. Characterizing Files/ Dynamic Analysis • How do various characteristics of available files change over different timescales? • Peers add/download or remove files • Peers join/leave the system 1) Variations in shared files by individual peers • Dynamics IP address introduces error 2) Variations in popularity of individual files • Trend in popularity changes Multimedia Computing & Networking 2006

  21. Characterizing Files/Dynamic Analysis Variations of files at individual peers • Ratio of added/removed files to total files (degree of change) • 3000 random peers • Timescales: 2hr, 6hr, 1day, 1wk • More change over longer timescales seems intuitive • Change in popularity of 50K files over one-day interval • More changes for more popular Multimedia Computing & Networking 2006

  22. Characterizing Files/Dynamic Analysis Change in file popularity Top 100 files • Change in popularity • For top 100 and 1000 files • Over different timescales • For any timescale, more popular files • exhibit larger changes • Changes occur more rapidly • Caching references is useful • These all seem intuitive but one needs to quantify rate of changes Top 1000 files Multimedia Computing & Networking 2006

  23. Characterizing Files/Dynamics Analysis Trends in Popularity Changes • Goal: to predict popularity of a file in the future? • No major change in popularity over several days • Larger changes over a few months • The key is to quantify the rate and pattern of changes. • Significantly more snapshots are required to derive any reliable conclusion Multimedia Computing & Networking 2006

More Related