1 / 30

Network-Aware Clustering of Web Clients

Explore innovative techniques for clustering IP addresses based on network-aware approaches, overcoming challenges of misidentified clusters using BGP routing data.

ttrevino
Download Presentation

Network-Aware Clustering of Web Clients

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network-Aware Clustering of Web Clients Advanced IP Topics Seminar, Fall 2000 Supervisor: Anat Bremler Speaker: Zotenko Elena

  2. Paper • presentation is based on: • Balachander Krishnamurthy and Jia Wang, “On Network-Aware Clustering Of Web Clients”, Proc. of ACM SIGCOM 2000; • Balachander Krishnamurthy and Jia Wang, “On Network-Aware Clustering Of Web Clients”, Technical Report 000101-01-TM, AT&T Labs-Research January 2000

  3. Agenda • problem definition • simple approach for the problem solution • network-aware approach for the problem solution using information from BGP routing tables • applications

  4. 9.3.5.110 9.3.5.111 12.2.94.30 12.2.95.30 12.2.95.33 Problem Definition • definition of clustering in our case: • a partitioning of a set of IP addresses into non-overlapping groups, such that all IP addresses in a group are topologically close and under common administrative control net A 9.3.5/255.255.255 9.3.5.110 9.3.5.111 12.2.94.30 12.2.95.30 12.2.95.33 net B 12.2.94/255.255.254

  5. Simple Approach • assumes that 24 MSB of each IP address identify network • groups IP addresses based on network portion of IP address • drawbacks, assumption is not always correct due to CIDR: • aggregation; • sub-netting;

  6. Simple Approach clusters identified correctly • misidentified clusters: • one network spans several clusters; • misidentified clusters: • one cluster contains several networks; network prefix distribution for BGP routing table snapshot for MAE-West NAP

  7. NAA Overview • identifies networks based on: • BGP routing tables snapshots • IP dump files • includes validation and adaptation stage

  8. prefix table clustering validation self-correction and adaptation NAA Overview • input – network prefixes from: • BGP routing table snapshots; • IP dump files from ARIN and NLANR; • output – prefix table: • contains all prefixes in one format;

  9. prefix table clustering validation self-correction and adaptation NAA Overview • input: • prefix table; • IP addresses for clustering; • output – raw clusters: • each network prefix represents a cluster; • put IP address into cluster with longest match;

  10. prefix table clustering validation self-correction and adaptation NAA Overview • example: • prefix table contains: • prefix A: 172.30.0.0/255.255.0.0 • prefix B: 172.30.110.0/255.255.255.0 • 172.30.110.256 will be assigned to cluster represented by prefix B; • 172.30.115.256 will be assigned to cluster represented by prefix A;

  11. prefix table clustering validation self-correction and adaptation NAA Overview • input: • raw clusters; • estimates the goodness of raw clusters by cross check on small number of clusters (sample of 1% of clusters);

  12. prefix table clustering validation self-correction and adaptation NAA Overview • goodness: • cluster are too big => includes IP addresses from different “networks”; • cluster are too small => several clusters include IP address from the same “network”; • “network” – group of IP addresses which are topologically close and under common administrative control;

  13. prefix table clustering validation self-correction and adaptation NAA Overview • dynamically change clustering according to changes in network topology

  14. NAA Building Prefix Table • ARIN (American Registry For Internet Numbers) IP dump file: • contains IP addresses registered with ARIN • on one hand may contain addresses of non-existent networks, thus is much larger than any BGP snapshot • on the other hand may contain IP address which contains several networks • only 1% clients is clustered based on IP dump files • BGP snapshot taken from AADS NAP: • publicly available via www.merit.edu/ipma/routing_table • much smaller than IP dump files; • contains networks that physically exists and are reachable;

  15. NAA Validation • cross-check of clustering based on names or if names are unavailable based on paths • based on assumption that, hosts in the same network: • share the same non-trivial suffix in their names • share the same last few hops on the paths toward them

  16. NAA Validation • why names can be unavailable (50% of clients): • host is behind a firewall • local network acquiring dynamic IP addresses via DHCP server • ISP does not having registered any names for its customers

  17. NAA Validation • validation procedure: • sample 1% of clusters • for each cluster: • use modified traceroute to resolve host name or last few hops toward host for each IP address in the cluster • if cluster contains hosts from several networks declare cluster as misidentified • if several clusters contain hosts from several networks declare those clusters as misidentified

  18. NAA Validation about 10 % of clusters are misidentified; one reason for misidentification is existence of national gateways (e.g. France, Japan), such that information about networks behind these gateways is unavailable in routing tables; about half of sampled clients have names resolvable

  19. BGP Dynamics And NAA • BGP routing tables change dynamically due to changes in network topology • NAA clusters clients based on BGP tables snapshots, which may not reflect current network topology or network topology in the time when client IP address was logged/recorded

  20. BGP Dynamics And NAA • trying to find out how BGP dynamics affect NAA clustering: • download BGP snapshots daily over period of n days • denote by S[i] set of prefixes downloaded during day i • denote by maximum effect to be the size of set of prefixes that change during entire testing period

  21. S[2] S[1] S[3] BGP Dynamics And NAA • example: testing period of 3 days set of prefixes which is unchanged during entire testing period of 3 days; set of prefixes which are changed during entire testing period of 3 days; the size of this set is maximum effect;

  22. BGP Dynamics And NAA number of prefixes in AADS BGP snapshot during 4th day in testing period of 14 days maximum effect observed till now, during 4 days number of prefixes from AADS BGP snapshot used to identify clusters from Apache server log maximum effect for prefixes used to identify Apache server log clusters observed till now; maximum effect is about 121/3929 = 3% of all client clusters;

  23. NAA Adaptation • although empirical results show that only about 3% of client clusters are affected by changing network topology can employ adaptation step to improve NAA applicability • run periodic traceroute on sampled clusters • using traceroute results: • merge clusters that span the same network into one cluster • divide cluster that include several networks into several clusters

  24. caching proxy: • acts as server for clients; • acts as client for original server; • caches frequently accessed resources; Applications • position of WEB caching proxy

  25. Applications • filter out hosts with unusual access patterns: • caching proxy • spider

  26. Applications client request distribution of a client cluster containing a spider a spider which issues 99.79% of all requests in the cluster

  27. Applications number of request over time for entire server log number of request over time for cluster containing a proxy number of request over time for cluster containing a spider

  28. Applications • identify busy clusters based on metrics such as number of clients, number of requests issued • in front of each busy cluster place caching proxy

  29. Applications

  30. THE END

More Related