180 likes | 330 Views
On Network-Aware Clustering of Web Clients. Balachander Krishnamurthy bala@research.att.com AT&T Labs-Research, Florham Park, NJ, USA Jia Wang jiawang@cs.cornell.edu Cornell University, Ithaca, NY, USA. Outline. Introduction Simple approaches to clustering Network-aware approach
E N D
On Network-Aware Clustering of Web Clients Balachander Krishnamurthy bala@research.att.com AT&T Labs-Research, Florham Park, NJ, USA Jia Wang jiawang@cs.cornell.edu Cornell University, Ithaca, NY, USA
Outline • Introduction • Simple approaches to clustering • Network-aware approach • Applications of client clustering • Conclusion and future work On Network-Aware Clustering of Web Clients
Introduction • Original goal: identify the group of clients that are responsible for a significant portion of a Web site’s requests • Cluster • Non-overlapping • Topologically close • Under common administrative control • But, identifying clusters requires knowledge that is not available to anyone outside the administrative entities. • Network-aware approach – BGP based On Network-Aware Clustering of Web Clients
Simple approaches • Two approaches • Use traditional Class A, Class B and Class C networks • Assume prefix length is 24 bits • They are simple, but do not give good results (~50% accuracy). • Counter example On Network-Aware Clustering of Web Clients
Network-aware approach • Use BGP routing and forwarding table snapshots • Routing table entries clusters • Example snapshot of BGP routing table On Network-Aware Clustering of Web Clients
Source of IP addresses BGP routing tables Prefix extraction, unification, merging IP address extraction IP addresses Prefix table Client cluster identification Automated process Raw client clusters Validation (optional) Examining impact of network dynamics Self-correction and adaptation Client clusters Clustering process On Network-Aware Clustering of Web Clients
Network prefix extraction • Prefix entry extraction (BGP tables from 14 places via automated scripts) AADS, MAE-EAST, MAE-WEST, PACBELL, PAIX, ARIN, AT&T-Forw, AT&T-BGP, CANET, CERFNET, NLANR, OREGON, SINGAREN, and VBNS. • Prefix format unification and merging • Three formats: x1.x2.x3.x4/k1.k2.k3.k4 x1.x2.x3.x4/m x1.x2.x3.0 • Assembled total 391,497 unique prefix entries (412,109 entries by 7/24/2000) On Network-Aware Clustering of Web Clients
Client cluster identification • Methodology • Extract the client IP address from the server log • Perform longest prefix matching on each client IP address • Classify all the client IP addresses which have the same longest matched prefix into a client cluster • Experiments • Experiments on wide range of Web server logs • Results • > 99% clients can be grouped into clusters • ~ 90% sampled clusters passed our validation tests On Network-Aware Clustering of Web Clients
Server logs used in our experiments On Network-Aware Clustering of Web Clients
Example: Nagano server log On Network-Aware Clustering of Web Clients
Example: Nagano server log (cont.) On Network-Aware Clustering of Web Clients
Validation of clustering • Validation - fundamentally difficult problem • A client cluster may be mis-identified by being too large or too small • Two approaches • nslookup-based test • Optimized traceroute-based test • Results on sampled 1% client clusters • A client cluster is mis-identified even if there is one client in the cluster doesn’t share same suffix with others. • Error rate of network-aware approach: ~10% • Error rate of simple approach: ~50% • Possible reason of mis-clustering: route aggregation, national gateway proxies • Effect of BGP prefix changes: < 3% (during 2 weeks) On Network-Aware Clustering of Web Clients
? Applications • Web caching, content distribution, server replication, traffic management and load balancing, Internet map discovery, etc. • Example: Web caching • Client classification: Normal client, proxy, and spider • Identifying spiders/proxies based on access patterns spider proxy On Network-Aware Clustering of Web Clients
Detecting proxy/spider On Network-Aware Clustering of Web Clients
Thresholding client clusters • Metric: number of requests issued from within a client cluster • 70% of the total requests in the server log • Web caching simulation On Network-Aware Clustering of Web Clients
New dataset • Altavista server log containing 60,011,458 requests issued by 2,503,974 clients all over the world. • # clusters: 100,091 • # busy clusters: 242 • Accuracy: 91% • Clustering works on large, general portal site data. • Thanks to Altavista for sharing data with us. The data included only client IP addresses with no personally identifiable information. On Network-Aware Clustering of Web Clients
Conclusion and future work • Network-aware client clustering • Based on BGP routing table snapshots • Ability to cluster >99% of clients in the server logs • Error rate is 10% (~ 50% for the simple approach) • Immune to BGP dynamics • Variety of applications • Ongoing work • Online algorithm • Super/sub clustering • Server clustering • Server replication application • Future work • Better validation • Lower error rate • Other applications On Network-Aware Clustering of Web Clients
Acknowledgement Thanks to the following people for helping us in this project. Jennifer Rexford Anja Feldmann Tim Griffin Bill Manning Vern Paxson Craig Labovitz Thomas Narten Steven Bellovin Emden Gansner Nick Duffield S. Keshav Walter Willinger On Network-Aware Clustering of Web Clients