300 likes | 319 Views
Explore innovative techniques for clustering IP addresses based on network-aware approaches, overcoming challenges of misidentified clusters using BGP routing data.
E N D
Network-Aware Clustering of Web Clients Advanced IP Topics Seminar, Fall 2000 Supervisor: Anat Bremler Speaker: Zotenko Elena
Paper • presentation is based on: • Balachander Krishnamurthy and Jia Wang, “On Network-Aware Clustering Of Web Clients”, Proc. of ACM SIGCOM 2000; • Balachander Krishnamurthy and Jia Wang, “On Network-Aware Clustering Of Web Clients”, Technical Report 000101-01-TM, AT&T Labs-Research January 2000
Agenda • problem definition • simple approach for the problem solution • network-aware approach for the problem solution using information from BGP routing tables • applications
9.3.5.110 9.3.5.111 12.2.94.30 12.2.95.30 12.2.95.33 Problem Definition • definition of clustering in our case: • a partitioning of a set of IP addresses into non-overlapping groups, such that all IP addresses in a group are topologically close and under common administrative control net A 9.3.5/255.255.255 9.3.5.110 9.3.5.111 12.2.94.30 12.2.95.30 12.2.95.33 net B 12.2.94/255.255.254
Simple Approach • assumes that 24 MSB of each IP address identify network • groups IP addresses based on network portion of IP address • drawbacks, assumption is not always correct due to CIDR: • aggregation; • sub-netting;
Simple Approach clusters identified correctly • misidentified clusters: • one network spans several clusters; • misidentified clusters: • one cluster contains several networks; network prefix distribution for BGP routing table snapshot for MAE-West NAP
NAA Overview • identifies networks based on: • BGP routing tables snapshots • IP dump files • includes validation and adaptation stage
prefix table clustering validation self-correction and adaptation NAA Overview • input – network prefixes from: • BGP routing table snapshots; • IP dump files from ARIN and NLANR; • output – prefix table: • contains all prefixes in one format;
prefix table clustering validation self-correction and adaptation NAA Overview • input: • prefix table; • IP addresses for clustering; • output – raw clusters: • each network prefix represents a cluster; • put IP address into cluster with longest match;
prefix table clustering validation self-correction and adaptation NAA Overview • example: • prefix table contains: • prefix A: 172.30.0.0/255.255.0.0 • prefix B: 172.30.110.0/255.255.255.0 • 172.30.110.256 will be assigned to cluster represented by prefix B; • 172.30.115.256 will be assigned to cluster represented by prefix A;
prefix table clustering validation self-correction and adaptation NAA Overview • input: • raw clusters; • estimates the goodness of raw clusters by cross check on small number of clusters (sample of 1% of clusters);
prefix table clustering validation self-correction and adaptation NAA Overview • goodness: • cluster are too big => includes IP addresses from different “networks”; • cluster are too small => several clusters include IP address from the same “network”; • “network” – group of IP addresses which are topologically close and under common administrative control;
prefix table clustering validation self-correction and adaptation NAA Overview • dynamically change clustering according to changes in network topology
NAA Building Prefix Table • ARIN (American Registry For Internet Numbers) IP dump file: • contains IP addresses registered with ARIN • on one hand may contain addresses of non-existent networks, thus is much larger than any BGP snapshot • on the other hand may contain IP address which contains several networks • only 1% clients is clustered based on IP dump files • BGP snapshot taken from AADS NAP: • publicly available via www.merit.edu/ipma/routing_table • much smaller than IP dump files; • contains networks that physically exists and are reachable;
NAA Validation • cross-check of clustering based on names or if names are unavailable based on paths • based on assumption that, hosts in the same network: • share the same non-trivial suffix in their names • share the same last few hops on the paths toward them
NAA Validation • why names can be unavailable (50% of clients): • host is behind a firewall • local network acquiring dynamic IP addresses via DHCP server • ISP does not having registered any names for its customers
NAA Validation • validation procedure: • sample 1% of clusters • for each cluster: • use modified traceroute to resolve host name or last few hops toward host for each IP address in the cluster • if cluster contains hosts from several networks declare cluster as misidentified • if several clusters contain hosts from several networks declare those clusters as misidentified
NAA Validation about 10 % of clusters are misidentified; one reason for misidentification is existence of national gateways (e.g. France, Japan), such that information about networks behind these gateways is unavailable in routing tables; about half of sampled clients have names resolvable
BGP Dynamics And NAA • BGP routing tables change dynamically due to changes in network topology • NAA clusters clients based on BGP tables snapshots, which may not reflect current network topology or network topology in the time when client IP address was logged/recorded
BGP Dynamics And NAA • trying to find out how BGP dynamics affect NAA clustering: • download BGP snapshots daily over period of n days • denote by S[i] set of prefixes downloaded during day i • denote by maximum effect to be the size of set of prefixes that change during entire testing period
S[2] S[1] S[3] BGP Dynamics And NAA • example: testing period of 3 days set of prefixes which is unchanged during entire testing period of 3 days; set of prefixes which are changed during entire testing period of 3 days; the size of this set is maximum effect;
BGP Dynamics And NAA number of prefixes in AADS BGP snapshot during 4th day in testing period of 14 days maximum effect observed till now, during 4 days number of prefixes from AADS BGP snapshot used to identify clusters from Apache server log maximum effect for prefixes used to identify Apache server log clusters observed till now; maximum effect is about 121/3929 = 3% of all client clusters;
NAA Adaptation • although empirical results show that only about 3% of client clusters are affected by changing network topology can employ adaptation step to improve NAA applicability • run periodic traceroute on sampled clusters • using traceroute results: • merge clusters that span the same network into one cluster • divide cluster that include several networks into several clusters
caching proxy: • acts as server for clients; • acts as client for original server; • caches frequently accessed resources; Applications • position of WEB caching proxy
Applications • filter out hosts with unusual access patterns: • caching proxy • spider
Applications client request distribution of a client cluster containing a spider a spider which issues 99.79% of all requests in the cluster
Applications number of request over time for entire server log number of request over time for cluster containing a proxy number of request over time for cluster containing a spider
Applications • identify busy clusters based on metrics such as number of clients, number of requests issued • in front of each busy cluster place caching proxy