On Network-Aware Clustering of Web Clients

On Network-Aware Clustering of Web Clients Balachander Krishnamurthy bala@research.att.com AT&T Labs-Research, Florham Park, NJ, USA Jia Wang jiawang@cs.cornell.edu Cornell University, Ithaca, NY, USA

Outline • Introduction • Simple approaches to clustering • Network-aware approach • Applications of client clustering • Conclusion and future work On Network-Aware Clustering of Web Clients

Introduction • Original goal: identify the group of clients that are responsible for a significant portion of a Web site’s requests • Cluster • Non-overlapping • Topologically close • Under common administrative control • But, identifying clusters requires knowledge that is not available to anyone outside the administrative entities. • Network-aware approach – BGP based On Network-Aware Clustering of Web Clients

Simple approaches • Two approaches • Use traditional Class A, Class B and Class C networks • Assume prefix length is 24 bits • They are simple, but do not give good results (~50% accuracy). • Counter example On Network-Aware Clustering of Web Clients

Network-aware approach • Use BGP routing and forwarding table snapshots • Routing table entries  clusters • Example snapshot of BGP routing table On Network-Aware Clustering of Web Clients

Source of IP addresses BGP routing tables Prefix extraction, unification, merging IP address extraction IP addresses Prefix table Client cluster identification Automated process Raw client clusters Validation (optional) Examining impact of network dynamics Self-correction and adaptation Client clusters Clustering process On Network-Aware Clustering of Web Clients

Network prefix extraction • Prefix entry extraction (BGP tables from 14 places via automated scripts) AADS, MAE-EAST, MAE-WEST, PACBELL, PAIX, ARIN, AT&T-Forw, AT&T-BGP, CANET, CERFNET, NLANR, OREGON, SINGAREN, and VBNS. • Prefix format unification and merging • Three formats: x1.x2.x3.x4/k1.k2.k3.k4 x1.x2.x3.x4/m x1.x2.x3.0 • Assembled total 391,497 unique prefix entries (412,109 entries by 7/24/2000) On Network-Aware Clustering of Web Clients

Client cluster identification • Methodology • Extract the client IP address from the server log • Perform longest prefix matching on each client IP address • Classify all the client IP addresses which have the same longest matched prefix into a client cluster • Experiments • Experiments on wide range of Web server logs • Results • > 99% clients can be grouped into clusters • ~ 90% sampled clusters passed our validation tests On Network-Aware Clustering of Web Clients

Server logs used in our experiments On Network-Aware Clustering of Web Clients

Example: Nagano server log On Network-Aware Clustering of Web Clients

Example: Nagano server log (cont.) On Network-Aware Clustering of Web Clients

Validation of clustering • Validation - fundamentally difficult problem • A client cluster may be mis-identified by being too large or too small • Two approaches • nslookup-based test • Optimized traceroute-based test • Results on sampled 1% client clusters • A client cluster is mis-identified even if there is one client in the cluster doesn’t share same suffix with others. • Error rate of network-aware approach: ~10% • Error rate of simple approach: ~50% • Possible reason of mis-clustering: route aggregation, national gateway proxies • Effect of BGP prefix changes: < 3% (during 2 weeks) On Network-Aware Clustering of Web Clients

? Applications • Web caching, content distribution, server replication, traffic management and load balancing, Internet map discovery, etc. • Example: Web caching • Client classification: Normal client, proxy, and spider • Identifying spiders/proxies based on access patterns spider proxy On Network-Aware Clustering of Web Clients

Detecting proxy/spider On Network-Aware Clustering of Web Clients

Thresholding client clusters • Metric: number of requests issued from within a client cluster • 70% of the total requests in the server log • Web caching simulation On Network-Aware Clustering of Web Clients

New dataset • Altavista server log containing 60,011,458 requests issued by 2,503,974 clients all over the world. • # clusters: 100,091 • # busy clusters: 242 • Accuracy: 91% • Clustering works on large, general portal site data. • Thanks to Altavista for sharing data with us. The data included only client IP addresses with no personally identifiable information. On Network-Aware Clustering of Web Clients

Conclusion and future work • Network-aware client clustering • Based on BGP routing table snapshots • Ability to cluster >99% of clients in the server logs • Error rate is 10% (~ 50% for the simple approach) • Immune to BGP dynamics • Variety of applications • Ongoing work • Online algorithm • Super/sub clustering • Server clustering • Server replication application • Future work • Better validation • Lower error rate • Other applications On Network-Aware Clustering of Web Clients

Acknowledgement Thanks to the following people for helping us in this project. Jennifer Rexford Anja Feldmann Tim Griffin Bill Manning Vern Paxson Craig Labovitz Thomas Narten Steven Bellovin Emden Gansner Nick Duffield S. Keshav Walter Willinger On Network-Aware Clustering of Web Clients

On Network-Aware Clustering of Web Clients

On Network-Aware Clustering of Web Clients

Presentation Transcript

Context-Aware Clustering

Network-aware OS

Clustering and Network

Network-Aware Clustering of Web Clients

Network-aware OS

Network clients

Network Programming: Clients

Web Clients

Network-aware OS

Network-aware OS

Web Page Clustering based on Web Community Extraction

Context-Aware Clustering

Network-aware OS

Network Aware Module

An Energy aware Distributed Clustering Protocol in Wireless Sensor Network

Network clustering

Clustering of Interaction Network

Context-Aware Clustering

Network-aware OS

Clustering of Web pages

Network-aware OS