80 likes | 99 Views
Scalable Performance Monitoring for Wide Area Applications (WAA):. Louiqa Raschid University of Maryland DIMACS February 2002 Collaborators Avigdor Gal, Technion Israel Vladimir Zadorozhny, University of Pittsburgh CNRI Handle Development Team Samir Khuller and Bobby Bhattacharjee (UM).
E N D
Scalable Performance Monitoring forWide Area Applications (WAA): Louiqa Raschid University of Maryland DIMACS February 2002 Collaborators Avigdor Gal, Technion Israel Vladimir Zadorozhny, University of Pittsburgh CNRI Handle Development Team Samir Khuller and Bobby Bhattacharjee (UM)
Overview • WAA Context: Large numbers of clients (10K+), servers (1K+) and digital content (10+8). • Problem: Performance monitoring includes reachability testing and validation of digital content. How long to download? Will the request time out? • Challenges: Dynamic WAN requires constant monitoring. Heterogeneity includes mobile clients and popular servers. To be scalable we rely on passive information gathering of client requests. Cannot rely on proprietary tools (Appliant, Keynote) since it is an autonomous federation of clients and servers. • Testbed: CNRI’s Handle is an emerging IETF/IRTF standard for exchanging digital content.
Latency Profiles • Given a very large set of Handle clients and Handle servers, how do we develop Latency Profiles (to predict end-to-end latencies) for a cluster of clients and servers. • A Latency Profile will predict performance for the cluster if • Internet distance between all pairs of clients and servers are comparable. (University of Michigan IDMaps; BGP tables and AS prefix identification) • Network topology between all pairs of clients and servers are comparable. (Points of congestion – SPAND UC Berkeley). • Time and Day dependencies on workload (network and server) for cluster are comparable. (WebPT based learning and prediction - University of Maryland). L. Raschid — University of Maryland
Source S1 EST Client C1 S2 EST congestion EST +8 Source S3 • (C1, S1), (C1, S2), (C1, S3) have similar Internet distance and may be in the same cluster. • S2 should be eliminated from the cluster due to dissimilar network topology (point of congestion), e.g., a busy router near S2. • S3 should be eliminated since it is in EST +8 – geographic location (Time, Day) may impact workload on S3. L. Raschid — University of Maryland
Scalable Performance Monitoring • Constructing Access Cost Distribution (ACD) for each (client, server) pair in cluster • Passive information gathering at clients • Statistical analysis to extract features of ACDs • WebPT based learning of features (Noise, Significance of Time, Day) • Clustering Access Cost Distributions (ACDs) • Distance measure for cluster – relevant ACD features • Validating clusters based on IDMaps, topology, etc. • Eliminate spurious clusters L. Raschid — University of Maryland
Scalable Performance MonitoringLocating Performance Monitors (PMs) - Identifying clusters • One or more PMs per cluster (at client). • Bottom-up approach: Identify an initial cluster; Augment with clients and servers; Impact on Latency Profile. • Top down approach: Triangulation in network space; Identify and eliminate distant clients and servers which could not occur in a cluster. • IDMap distances, BGP Routing Tables / AS prefix identifiers • Issues • Scalable • Security (non invasive) L. Raschid — University of Maryland
Scalable Performance MonitoringContinuous performance monitoring • PM continuously updates Profile. • Hierarchy of PMs for large cluster • Cluster PM may have less accurate and less up-to-date Profile compared to a subcluster PM. • Cluster evolution over time: Threshold of variance in Latency Profiles. • Handle Protocol / Client support • Cluster identification • Neighbor indentification • Exchange various data including profiles • PM is also used for probing • Probing is conditioned by frequency / impact of server update L. Raschid — University of Maryland
Current Status / Related Work • Small scale validation with 6 clients and 15 servers on DOI Handle servers (publishers) in December 2001. • Large scale data gathering – Summer 2002. • Integration with a (stochastic) model of staleness [Data obsolescence (Gal – Technion, Israel)] and study of access latency / data staleness trade-off. • Implementation on Squid cache: Simple utility based decision model; Extend with Latency Profiles. L. Raschid — University of Maryland