1 / 30

EtE: Passive End-to-End Internet Service Performance Monitoring

EtE: Passive End-to-End Internet Service Performance Monitoring. Yun Fu, Lucy Cherkasova , Wenting Tang, and Amin Vahdat HPLabs and Duke University. ???. HP.com. Service provider problems.

cconrad
Download Presentation

EtE: Passive End-to-End Internet Service Performance Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EtE: Passive End-to-End Internet Service Performance Monitoring Yun Fu, Lucy Cherkasova, Wenting Tang, and Amin Vahdat HPLabs and Duke University

  2. ??? HP.com Service provider problems... • A lot of research is done to optimize web server performance in order to improve client experience BUT • Do we know what is the client experience? • What are the critical latency components in the end-to-end response time? • Do we know whether the improvements on the web server sideindeed improve end-user experience? • Do we know who the clients are and where they are located on the Internet?

  3. End-to-End Web Service Measurement: Why Is It Important? • Two main factors impact the response time perceived by the clients: network latency and server side processing time • Many web sites use complex multi-tiered architecture • A set of new technologies, such as servlets and Javaserver pages, extend the web servers to generate information-rich dynamic web pagesand to leverage existing business systems • Combination of these technologies could lead to increased server-side processing time especially in distributed environment • New ad-hoc business metric: web service is considered to be “unavailable” if its response time exceeds 6 seconds • The service providers need a quantitative analysis of the major latency components contributing to the response time to achieve given business and QoS objectives: • Invest in more powerful site infrastructure or • Choose a CDN service?

  4. Why Is It Difficult? • Web pages arecomplexobjects with multiple embedded images • HTTP protocol is stateless: different images are requested by client browser independently: • Some of them are issued concurrently • Some of them use persistent connections • Some of them are obtained from proxies • Some of them are obtained from user browser caches • The response time of a web page observed by the client is the result ofdownload of all page related images

  5. What Are Currently Available Solutions? • Active periodic probing of a particular web page from a fixed number of clients across the Internet • Keynote service • Keynote “clients” are not the real web site clients • Allows monitoring of a particular web page • Always pulls the entire page (with all embedded images) from the server • Page instrumentation technique based ondownloadable JavaScript or Java Applet to a client web browser • HP Open View “Web Transaction Observer” • The measurement starts after download of the main html page (significant portion of the response time is missing) • Does not provide latency breakdown unless the web server is also instrumented • eBusiness Assurance (eBA, from Candle Corp) • Quality of Service (QoS) Monitor (IBM, Tivoli) • Research paper by Rajamony and Elnozahy from IBM (Austin) uses JavaScript to instrument the links to particular pages. Somewhat more limited: cannot measure directly accessed pages, e.g “index.html”…

  6. What Do We Propose? • EtE monitor • Passive monitoring tool for end-to-end response time measurement • Non-intrusive, does not require any changes or modifications to a site content, or server side infrastructure, or client browsers • Can be used for sites with static or dynamically generated content • What does it provide? • End-to-end response measurement for all the pages and all the clients accessing the site • Analysis of response components: • Server processing time portion • Network transfer time portion • Reports the % of data delivered from the server vs the % of data cached on the client side • Reports the % of aborted page accesses and the related performance reasons • Analysis of the most frequently accessed documents and their response time • Client clustering by ASes (Autonomous Systems) • Requests (bytes) clustering by ASes and the corresponding response time • And more …..

  7. EtE Monitor Architecture • The Network Packet Collector module: collects network packets using tcpdump and records them in Network Trace enabling offline analysis. • In the Request-Response Reconstruction module, EtE monitor reconstructs all TCP connections from the Network Trace and extracts HTTP transactions (a request with corresponding response) from the payload. EtE monitor stores the HTTP header lines and other related information in the Transaction Log • TheWeb Page Reconstructionmodule is responsible for grouping the request-response pairs into logical web page accesses and stores them in the Web Page Session Log • The Performance Analysis and Statistics module summarizes a variety of performance characteristics integrated across all client accesses

  8. Request-Response Reconstruction Module • The TCP connections are rebuilt from Network Trace using: • The client IP address • The client port number • The request (response) TCP sequence number • Within the payload of the rebuilt TCP connections, HTTP transactions are delimited as defined by HTTP protocol • After reconstructing the HTTP transactions, the monitor records the HTTP header lines and other information of interest in the Transaction Log and discards the transaction body

  9. Request-Response Reconstruction Module (continuation) • Each entryin the Transaction Log includes: • The client IP address • A unique flow ID for TCP connection • The requested URL • The content type • The payload size • The referer field • The via field • Whether the request was aborted • The number of packets resent in the response • The corresponding timestamps

  10. Page Reconstruction Module • To measure the client perceived end-to-end response time for retrieving a web page, we need to group the objects in a web page access • We use two-pass heuristic method and statistical filtering mechanism to reconstruct different client page access • First pass: EtE monitor uses the HTTP requests with refererfield to build a Knowledge Base of web pages and their embedded objects • Second pass: • EtE monitor reconstructs the page accesses without referer field using the Knowledge Base of web pages and some additional heuristics • EtE monitor uses statistical analysis to identify valid access patterns and filter the accesses grouped incorrectly

  11. Example Example of initial html.file request and the following embedded object request with corresponding referer field:

  12. First Pass: Client Access Table • EtE monitor stores web page access information into a hash table using client IP addresses: • If the content type is text/html, a new web page entry is created in the Web Page Table • For other types, the request URL is inserted according to itsreferer field

  13. Building a Knowledge Base of Web Pages • From the Client Access Table, EtE monitor determines the content template of any given web page as a combined set of all objects that appear in all access patterns for this page

  14. Second Pass: Reconstruction of Web Page Accesses • With the help of Knowledge Base, EtE monitor processes the entire Transaction Log again, and creates a new Client Access Table • This time it processes the objects without referer field: • EtE monitor consults the Knowledge Base while checking all the page entries in the Web Page Table to find the page an object might be embedded in, and appends it at the end of that page • If none of the web page entries in the Web Page Table contains the object based on the Knowledge Base then • EtE monitor searches for the page accessed with the same flow ID • Otherwise it appends the object to the latest accessed page (additionally it uses configurable think time threshold to delimit web pages) • If the think time threshold is exceeded, the object is dropped

  15. Identifying Valid Accesses Using Statistical Analysis of Access Patterns • Although the above two-pass process is very efficient, there could still be some accesses grouped incorrectly • We use a statistical analysis to better approximate the actual content of web pages and filter out the incorrectly constructed accesses

  16. Metrics to Measure Web Service Performance • Response time metrics • End-to-end response time observed by the client for a web page download • Latency breakdown: server related and network related portions • Connection set-up time • Metrics evaluating web service caching efficiency • Server file hit ratio • Server byte hit ratio • Aborted pages and QoS • Why the accesses are aborted: • Bad performance? • Client browsing patterns?

  17. Example: 1-object page retrieval(basic timestamps)

  18. Latency Breakdown for Multiple Concurrent Connections: Server Processing vs Network

  19. Metrics Evaluating Web Service Caching Efficiency • Original web page url1 (page template): • 10 objects, • 100 Kbytes. • Access to url1: Acc1 • 5 objects, • 70 Kbytes. FileHitRatio(Acc1) = 5/10, 50% ByteHitRatio(Acc1)=70/100, 70% • Access to url1: Acc2 • 7 objects, • 80 Kbytes. FileHitRatio(Acc1) = 7/10, 70% ByteHitRatio(Acc1)=80/100, 80% ServerFileHitRatio(url1) = (5/10 + 7/10) / 2, 60% ServerByteHitRatio(url1) = (70/100 + 80/100) / 2, 75% The smalleris the better!

  20. Case Studies • HPL external site (HPL) • From July12, 2001 to August 11, 2001 • The site has mostly static content • Open View Support site (Support) • From October 11, 2001 to October 25, 2001 • The site uses JavaServer Pages technology for dynamic page generation

  21. Sites Statistics At-A-Glance

  22. HPLabs Site Case Study HPL site during a month (accesses to index.html page) • Figure shows the EtE time to index.html on hourly scale during a month • In spite of overall good performance, hourly averages reflect significant • variation in response time observed by the clients • Periods of increased latency correspond to weekends! • What is the problem?

  23. Understanding the Client Population • Resent packets typically reflect network congestion or network–related bottlenecks • Periods of increased resent packets correspond to weekends • The explanation: the client population significantly “changes” during weekends • Most of the clients access the web site from home via low-bandwidth connections • It is extremely important to understand the client population! • Active probing approach using artificial clients (with typically “good” connection to • the Internet) lacks this information

  24. Performance Analysis of Accesses to itanium.html • First Figure: • Number of accesses to itanium.html page • From being the most popular page in the beginning of the study, it gets to the 7th place after 10 days • Second Figure • Percentage of accesses above 6 sec to itanium.html page • Question: why is the latency observed by the clients getting higher?

  25. Caching Efficiency of the Page When the page is getting less popular, “colder”, the number of objects and bytes retrieved from the original server increases significantly: i.e. fewer network caches store the page related objects It translates into increased response time observed by the client Active probing techniquecannotreflect the caching efficiency of the site The tools based on instrumentation technique cannotprovide insight into this problem either

  26. Clients Clustering by ASes • Clients grouped by ASes show a heavy tail distribution • These figures allow us to see large client clusters and their corresponding • end-to-end response time • The ability of EtE monitor to measure performance metrics for acertain group of clients • is particularly attractive for Service Providers to validate required SLAs

  27. Validation Experiments • We performed two groups of experiments • To validate the accuracy of EtE measurements • To evaluate the page access reconstruction power of EtE • How dependent are the reconstruction results on the existence of referer field information? • The results are encouraging: • EtE provides a very close approximation of the response time • EtE monitor does a good job of page reconstruction even when the requests do not have any referer field! • However, two-pass heuristic method and statistical filtering mechanism we use to reconstruct page accesses increase the number of reconstructed pages by about 20-30%

  28. Limitations • EtE monitor is not appropriate for sites that encrypt much of their data (e.g., via SSL) • EtE monitor is not appropriate for sites that “outsource”most of their content to CDNs • Similar limitation applies to pages with “mixed” content: if a portion of the page is served from some other remote sites. In this case, EtE will measure only response time for local site content • For clients coming behind the proxy, EtE monitor measures the response time as observed from the proxy • Since the tool is based on heuristics and statistics to reconstruct the page content, the best results are obtained when the sample size is large enough • Dynamically generated content creates additional challenges for EtE monitor (typical for other analysis tools too): a configuration file provided by a site administrator is needed

  29. Conclusion and Future Work • Understanding performance characteristics of Internet services is critical to evolving and engineering the web services to match: • Changing demand levels • Client populations • Global network characteristics • EtE monitor, based on a novel technique, offers a number of benefits unavailable from other tools and by other means. • EtE monitor can be extended to work in “almost real-time” to provide timely information about web services and their performance. • Extended analysis on client clustering will provide an opportunity to use the information from EtE monitor for intelligent decision making on service placement and service optimization

  30. Acknowledgements • The tool and the study would not be possible without a generous help of our HP colleagues: • HPLabs team: • Mike Rodriquez, Annabelle Eseo, and Peter Haddad • HPO, Managed Web Services: • Guy Mathews • OpenView team: • Steve Yonkaitis, Bob Husted, Norm Follett, and Don Reab • US support team • Claude Villermain, Vincent Rabiller, Pierre-Emmanuel Delforge Their help is highly appreciated !

More Related