260 likes | 276 Views
A Comparative Analysis of Web and P2P Traffic. Naimul Basher (University of Calgary) Aniket Mahanti (University of Calgary) Anirban Mahanti (IIT, Delhi) Carey Williamson (University of Calgary) Martin Arlitt (U. Calgary and HP Labs) WWW 2008, Beijing. Introduction.
E N D
A Comparative Analysis of Web and P2P Traffic Naimul Basher (University of Calgary) Aniket Mahanti (University of Calgary) Anirban Mahanti (IIT, Delhi) Carey Williamson (University of Calgary) Martin Arlitt (U. Calgary and HP Labs) WWW 2008, Beijing
Introduction • In the recent past, a significant proportion of Internet traffic volume was from Web applications using HTTP. • Web traffic is typically characterized by small-sized flows, short-lived connections, asymmetric flow volumes, and well-defined TCP port usage (e.g., 80, 8080, 443). • The advent of Peer-to-Peer (P2P) file sharing applications in the past decade has triggered a major paradigm shift in Internet data exchange. • P2P usage has grown steadily since its inception, and recent empirical studies report that Web and P2P together dominate today’s Internet traffic. WWW 2008, Beijing 2
Web and P2P Characterization • Question: How are they similar/different? • We use recent packet traces collected at a large university (30,000 students and employees) to characterize and compare traffic generated by current Web and P2P applications. • We also analyze and compare two P2P applications, BitTorrent and Gnutella. • We primarily focus on characterizing these applications at the flow-level and host-level. • Our work develops flow-level distributional models that may be used to refine Internet traffic models for use in network simulations and emulation experiments. WWW 2008, Beijing 3
Preview of Results Characteristics Web P2P Flow size Introduces many mice flows but few elephant flows. Introduces many mice and elephant flows. Flow IAT Typically short IAT. Typically long IAT. Flow duration Typically short-lived. Typically long-lived. Flow concurrency Most hosts maintain more than one concurrent flow. Many hosts maintain only one flow at a time. Transfer volume Large transfers are dominated by downstream traffic. Large transfers happen in either upstream or downstream direction. Geography Most externals hosts are located in the same geographic region. External peers are globally distributed. WWW 2008, Beijing 4
Trace Collection Methodology • Full packet traces were collected using lindump from the 100 Mbps full duplex commercial Internet connection of the University of Calgary. • Since P2P applications frequently use random ports, we used payload signatures to identify applications. • We used bro, a network intrusion detection system (IDS), to perform payload signature matching and map network flows to traffic types. • We used non-contiguous 1-hour traces collected each morning and evening on Thursday through Sunday between April 6 and April 30, 2006. WWW 2008, Beijing 5
Trace Summary TCP Trace Statistics Count Number of Flows 23 million Number of Packets 945 million Data Volume 585 GB Internet Applications Flows Bytes Web 40% 35% P2P 3% 33% P2P Applications Flows Bytes Gnutella 21% 78% BitTorrent 61% 17% WWW 2008, Beijing 6
Characterization Metrics • Flow-level characterization metrics • Flow size – total bytes transferred during a connection. Mice transfer < 10 KB. Elephants transfer > 5 MB. (Others are called Buffalo) • Flow duration – the time between the start and the end of a TCP flow (e.g., SYN and FIN). • Flow inter-arrival time (IAT) – the time between two consecutive flow arrivals. • Host-level characterization metrics • Flow concurrency – the maximum number of TCP flows a single host uses concurrently to transfer content to/from one or more hosts. • Transfer volume – the total bytes transferred to (downstream) and from (upstream) a host. • Geographic distribution – the distribution of the distance between hosts and U of C along the surface of the Earth. WWW 2008, Beijing 7
Flow Sizes: Web and P2P • P2P applications generate many small-sized flows and many very large-sized flows (many more than Web applications generate). • Small-sized P2P flows arise from signaling, aborted transfers, and conn attempts to unresponsive peers. • We also find some very large P2P flows, which are much larger than the large Web transfers. P2P model: Hybrid Pareto and Weibull Web model: Hybrid Pareto and Weibull WWW 2008, Beijing 8
Flow Sizes: Gnutella and BitTorrent • Gnutella and BitTorrent generate similar percentages of small-sized flows (e.g., control info exchanged between peers). • Gnutella generates more large-sized flows than BitTorrent. • Gnutella usually downloads entire object from a single peer. • BitTorrent uses file segmentation to split an object into multiple equal-sized pieces (e.g., 256 KB), and downloads the pieces using parallel flows and/or persistent connections. BitTorrent model: Hybrid Lognormal and Pareto Gnutella model: Hybrid Lognormal and Pareto WWW 2008, Beijing 9
Mice and Elephant Phenomenon • Web mice flows account for a relatively higher proportion of total Web bytes than P2P mice flows do for total P2P bytes. • P2P elephant flows are larger than Web elephant flows. • BitTorrent mice flows, on average, are larger than Gnutella mice flows because of BitTorrent’s signaling activities. • BitTorrent elephant flows, on average, are larger than Gnutella elephant flows. • Gnutella users share mostly audio files, while BitTorrent users share more video files. [CacheLogic P2P Study 2005] WWW 2008, Beijing 10
Flow Durations: Web and P2P • Approx. 70% of Web durations are < 1 sec indicating low response times for Web requests (i.e., good Internet connectivity on campus). • Approx. 30% of P2P flows are shorter than 30 sec. These often are signaling flows, or failed/aborted flows. • Some P2P mice flows have long durations due to repeated unsuccessful connection attempts. • Approx. 40% of P2P flow durations are between 20 and 200 sec. These reflect bandwidth-limited connections. P2P model: Hybrid Weibull and Pareto Web model: Two-mode Pareto WWW 2008, Beijing 11
Flow Durations: Gnutella and BitTorrent • BitTorrent flows typically last longer than Gnutella flows. • Longer BitTorrent flows resulted due to its protocol architecture – concurrent flows, fixed number of uploads/downloads permitted, persistent connections. • Gnutella can use a single flow for downloading an object (no need to share bandwidth with concurrent flows). BitTorrent model: Hybrid Lognormal and Pareto Gnutella model: Hybrid Lognormal and Pareto WWW 2008, Beijing 12
Flow Concurrency: Web and P2P • Many P2P hosts in our network maintain only a single TCP connection (a surprising result). • A significant proportion of internal Web hosts maintain more than one concurrent TCP connection. • Web browsers often initiate multiple concurrent connections to transfer content in parallel. • High degree of Web flow concurrency (> 30) is due to Web proxies, browser accelerators, and content distribution nodes. WWW 2008, Beijing 13
Distinct IP Addresses for Concurrent Flows Web P2P WWW 2008, Beijing Web tends to have multiple concurrent flows to same host. P2P hosts use concurrent flows to connect to many hosts. P2P protocols encourage connectivity with multiple hosts to facilitate widespread sharing of data. 14
Flow Concurrency: Gnutella and BT • Most Gnutella hosts connect with only one host at a time. • We observed a few Gnutella hosts with > 10 concurrent TCP connections. These hosts acted as super-peers in Gnutella’s peer hierarchy. • Most BitTorrent hosts exhibit a high degree of flow concurrency, which is a design feature of BitTorrent. WWW 2008, Beijing 15
Transfer Symmetry: P2P Applications • Transfer symmetry is a major concern for P2P system developers, who want to encourage fair sharing among participating peers. • We observe pronounced freeloading in Gnutella, and greater fairness in BitTorrent. • Gnutella host behavior appears to be dominated by extreme upstream and downstream transfers. • BitTorrent’s tit-for-tat mechanism encourages uploading for the opportunity to download. WWW 2008, Beijing 16
Heavy Hitters: Web and P2P WWW 2008, Beijing Heavy hitters are the few hosts that account for much of the traffic volume transferred. Heavy hitters are present in both Web and P2P. Top-ranked P2P hosts transfer an order of magnitude more data than top-ranked Web hosts. Most P2P heavy hitters are either freeloaders or benefactors. The total amount of data transferred by the top 10% of Web and P2P hosts follows a power-law distribution. 17
Geographic Distribution: Web and P2P • Approx. 75% of external Web hosts are in North America. Europe and Asia account for another 10% each. • A majority of our Web campus users are English speaking, and thus are likely to visit Web sites located in predominantly English-speaking countries. • Approx. 40% of P2P hosts are located within North America. • This indicates that connectivity between P2P hosts does not strongly rely on host locality, rather it depends on resource availability during connection establish phase. WWW 2008, Beijing 18
Geographic Distribution: Gnutella and BT • Approx. 70% of Gnutella hosts are located in North America. • This suggest either Gnutella peers prefer to connect with hosts that are in close proximity or that Gnutella clients are widely used in North America for file sharing. • Approx. 30% BitTorrent hosts are located in North America and approx. 40% are located in Europe. • We believe that the list of trackers is created based on host bandwidth availability in a swarm, and we see a bias towards regions with high broadband penetration. WWW 2008, Beijing 19
Effect of Network Traffic Management • At the University of Calgary, traffic is managed using a commercial packet shaping device. • At the time of capture the network policy was to group together all identified P2P flows and collectively limit their bandwidth to 56 Kbps. • We do not observe a strong positive correlation between flow size and duration. • Some P2P flows are indeed identified and limited by the traffic shaper, however, we do see many other P2P flows that escaped detection by the traffic shaper. • Our results provide a snapshot of Web and P2P characteristics from a large edge network, and should be representative of other edge networks with similar user population and network management policies. WWW 2008, Beijing 20
Summary of Results Characteristics Web P2P Flow size Introduces many mice but few elephant flows. Introduces many mice and elephant flows. Flow IAT Typically short IAT. Typically long IAT. Flow duration Typically short-lived. Typically long-lived. Flow concurrency Most hosts maintain more than one concurrent flow. Many hosts maintain only one flow at a time. Transfer volume Large transfers are dominated by downstream traffic. Large transfers happen in either upstream or downstream direction. Geography Most externals hosts are located in the same geographic region. External peers are globally distributed. WWW 2008, Beijing 21
Conclusions and Future Work • Our work presented an extensive characterization study of Web and P2P traffic using full packet traces collected at a large edge network (U of C campus). • We observed a number of contrasting features between Web and P2P traffic using flow-level and host-level metrics. • Flow-level distributional models were developed for Web and P2P traffic. These can be used in network simulation and emulation experiments. • Traffic from other networks should be studied to facilitate development of general models for Web and P2P traffic. • Impact of other non-Web applications, such as P2P VoIP and IPTV, can be studied as well. WWW 2008, Beijing 22
FLOW MODELS Characteristics Web P2P Gnutella BitTorrent Flow size Weibull-Pareto Weibull-Pareto Lognormal-Pareto Lognormal-Pareto Flow IAT Two-mode Weibull Weibull-Pareto Weibull-Pareto Weibull-Pareto Flow duration Two-mode Pareto Weibull-Pareto Lognormal-Pareto Lognormal-Pareto WWW 2008, Beijing 23
Inter-Arrival Times: Web and P2P • Web flow IAT are much shorter than those of P2P flows. • Web traffic has a higher arrival rate (80 flows/sec) compared to P2P traffic (6 flows/sec). • Another factor contributing to the lower arrival rate and the longer IAT values for P2P flows is the persistent nature of their TCP connections. P2P model: Hybrid Weibull and Pareto Web model: Two-mode Weibull WWW 2008, Beijing 24
Transfer Volume: Web and P2P • Approx. 50% of Web and P2P hosts transfer small amounts of data (< 1 MB) and are typically active for < 100 sec. • P2P hosts that repeatedly yet unsuccessfully attempt connecting to peers. • Web hosts that browse the Web, widgets that retrieve information from the Web periodically, and downloading small files. • Approx. 35% of Web and 15% of P2P hosts transfer data < 10 MB and are active for < 1000 sec. • P2P hosts that share small objects. • Web hosts that browse the Web for prolonged periods, downloading software/multimedia, and HTTP-based streaming. WWW 2008, Beijing 25