Understanding and Optimizing the Performance of Internet-based Systems

Understanding and Optimizing the Performance of Internet-based Systems Ben ZornPerformance Monitoring and Analysis Group Programmer Productivity Research Center (PPRC) Microsoft Research

Who “We” Are • Performance Monitoring and Analysis Group • Part of PPRC (directed by Amitabh Srivastava) • Developers, testers, and researchers • Recently formed with emphasis on .NET systems • Approach • Provide solutions to MS product teams through ideas, technologies, tools, and prototypes • Actively participate in the external research community through papers, leadership in professional community and grants

My Dad’s View of an Internet System “There’s a little person inside.” - George Zorn “Any sufficiently advanced technology is indistinguishable from magic” - Arthur C. Clarke

Outline • Introduction and motivation • “Powers-of-ten” drill down • A framework to attack the problem • Specific examples from a case study:Optimizing the memory hierarchy

Why Performance is Interesting and Hard • Context • Internet systems are probably the most complicated artifact ever created by man • They are currently immature • Improvements possible in 3 areas • Functionality, correctness, and performance • My focus is on performance • Efficiency is a central theme of the Internet revolution • Easier / cheaper to get to information, give information, and make informed decisions

Distributed System “Powers of Ten” • Inspired by the film “Powers of Ten” by Charles and Ray Eames • They looked at 38 orders of magnitude(local galactic group down to proton in a nucleus) • We’ll drill down into computer abstractions • Consider different logical abstraction layers

Back to My Dad… What really goes on in there?

1 2 Network “Cloud” View • Less than 7 items to remember • We seem to “get it” MSN server Internet ISP modem link • distinct roles • differentiated components Dad client

3 1 2 4 5 Expanded Cloud View DNS resolution MSN server cluster router streaming media ISP Dad client

2 1 Inside the Web Site database servers (back ends) … … Web servers (front ends) … … interconnection topology IP “director”

2 5 1 4 3 Inside a Web Server Web server program Extension API get static page serverextension generate HTML filter, parse request DB API network protocol stack device driver device driver DB server operating system

1 2 Inside the Server Extension enter proc checkData … load rx, addr 36 use rx … load rx, addr 110 use rx … return call checkData F T data valid? call SQL API return

2 1 Inside the Memory Hierarchy load CPU L1 cache L2 cache Main Memory Virtual Memory (Disk)

1 3 2 4 Inside the CPU 3 Diagram courtesy of Artur Klauser

??? Photo of a Pentium die A Sea of Gates So what does Dad think about all this? What can he or anybody do about performance? Image from the Computer Info Center http://bwrc.eecs.berkeley.edu/CIC/

Outline • Introduction and motivation • A framework to attack the problem • Resource management and optimization • Data collection, analysis, and action • Specific examples from a case study:Optimizing the memory hierarchy

Information is Essential • Optimization is really resource allocation • Allocation requires good decision making • Time / space trade-off • Where should data be stored, cached? • Challenges • What information do we need? • How do we get it? • How do we manage it? • What does it mean?

time,id What Information Can We Get? • Tag “interesting” events (like FedEx tracks packages) • Associate time/resources with events • Accumulate and analyze data Event repository

Information Management is Essential • It is easy to gather too much data • Our capacity to generate data follows Moore’s Law • Data without context is less valuable • How do we related data gathered with problems experienced? • Systems change (new builds daily) • Our abstractions are immature, current approaches are ad hoc • Data mining is a large potential opportunity

Example: Netmon Monitoring Tool • Netmon provides info about network packets • Netmon has a rich, extensible architecture(parsing, reporting) • Netmon provides data, but management, analysis, etc. have to be layered on top of it • Example output: 00000060 00 01 00 5F 00 00 5C 00 5C 00 52 00 45 00 44 00 ..._..\.\.R.E.D. 00000070 2D 00 44 00 43 00 2D 00 32 00 37 00 2E 00 52 00 -.D.C.-.2.7...R. 00000080 45 00 44 00 4D 00 4F 00 4E 00 44 00 2E 00 43 00 E.D.M.O.N.D...C. 00000090 4F 00 52 00 50 00 2E 00 4D 00 49 00 43 00 52 00 O.R.P...M.I.C.R. 000000A0 4F 00 53 00 4F 00 46 00 54 00 2E 00 43 00 4F 00 O.S.O.F.T...C.O. 000000B0 4D 00 5C 00 49 00 50 00 43 00 24 00 00 00 3F 3F M.\.I.P.C.$...?? 000000C0 3F 3F 3F 00 ???.

Conceptual Monitoring Framework Leak Detector Site Monitor Intrusion Alerting Weekly Report Tools Analysis Cluster Detection Store Store Filter Management Event Bus Actuators HW Perf Counters Path Trace Network Trace Reboot System Sensors

Outline • Introduction and motivation • A framework to attack the problem • Specific examples from a case study:Optimizing the memory hierarchy • Hardware performance counters • Vulcan: binary transformation infrastructure • Daedalus: data locality optimization

Parts of the Big Picture • Many different groups are working from similar frameworks • Commercial efforts (e.g., Windows WMI) • Many research efforts (e.g., Internet-scale caching) • I will focus on the lowest levels (CPU arch.) • Hardware can generate 100 million events/sec. • Data collection, reduction are significant problems • Concretely illustrates different parts of approach: • Data gathering, data reduction, abstraction

Optimizing the Memory Hierarchy load CPU L1 cache 64 Kbytes, UOT=32 bytes 1-4 cycles UOT=1 word 10-20 cycles L2 cache 1 Mbyte, UOT=32 bytes 100 cycles Main Memory 100 Mbytes, UOT=32 bytes UOT=Unit of Transfer 1,000,000 cycles Virtual Memory (Disk) 50,000 MbytesUOT=8192 bytes

Finding a Memory Problem • Problem • Some loads take a long time, but which ones? • Solution • Hardware vendors provide performance counters • Counters can be read, also interrupt processor • Causing interrupts at costly operations allows them to be recorded proc Foo … load rx, addr 36 use rx … load rx, addr 110 use rx … return This load takes too long This use of rx is what stalls

Exposing Performance Information addr 110 load 15,304 CPU L1 cache 1,346 addr 60 L2 cache C1 15,304 C2 257 257 Main Memory addr 36 performance counters(L1 hits, L2 misses) addr 116 Virtual Memory (Disk)

Extracting More Information • New Problem • Why was I calling procedure Foo? • What fraction of the total time did I spend in Foo? • Solution: Binary transformation • Program API to transform binary code • Calls to arbitrary routines can be “spliced” in • PPRC Vulcan infrastructure [Srivastava et al. ‘00] • X86, IA64 binaries • Instrumentation can be added on-the-fly

To this: proc Foo call probe_enter_Foo() … load rx, addr 36 use rx … load rx, addr 110 use rx … call probe_exit_Foo() return Example Transformation As code is executing, transform: This: proc Foo … load rx, addr 36 use rx … load rx, addr 110 use rx … return

call probe_enter_Foo call probe_exit_Foo How Vulcan Works Program • Hierarchical interface to structure of binaryforeach procedure…foreach basic blockforeach instruct… • Calls to arbitrary functions can be inserted anywhere proc Foo proc Bar Block 1 Block 1 load add use mult … Block 2 Block 2 load shift use move … …

Vulcan Tricks proc Foo … load rx, addr 36useful work not dependent on rx inserted here use rx … load rx, addr 110 use rx … return • Optimization Example:Instruction Scheduling • If “load rx” takes 100 cycles, find useful work to do between load and use • Other Vulcan uses: • Code obfuscation • Binary matching • Software watermarking • Software testing tools • Coverage • Fault injection 100 cycle delay

New Abstractions are Central to Success • Problem • How to reorganize data for better locality? • Context • Code reorganization is well understood because code structures are static • But… OO data structures are dynamic • Solution • New abstraction: sequences of “hot” objects • Daedalus Project [Chilimbi PLDI ’01]

Revisiting the Memory Hierarchy obj A CPU L1 cache obj H obj C obj L2 cache Goal: place“hot” objects closer to CPUConstraint:assume UOT = 2 objects obj E obj Main Memory obj B obj obj obj obj D obj obj obj obj F Virtual Memory (Disk) obj obj Load sequence: A F B C A F E E A F B C… obj G obj obj

Potential for Performance Improvement

Daedalus Project Goal: Analyze and exploit data locality • Analyze locality • Represent very large streams of references (SEQUITUR algorithm [Nevill-Manning, Witten ‘97] ) • Define new abstractions (hot data streams) • Exploit locality • Build customized heap allocators (malloc/new) • Insert prefetching instructions (PIII, etc. support) • Data restructuring tools

SEQUITUR (Example) aaabacaaabacaaabacaaabacaaabadaaabadaaabadaaabadaaabadaa SEQUITUR S -> BBDDCaa A -> aaabac B -> AA C -> aaabad D -> CC S B D DAG representation of grammar A C SEQUITUR Grammar a b c d

Pruning SEQUITUR Execute Hot data stream analyses Locality Analysis Program data reference trace Program S S A C A B B a b c d b d a Hot Data Streams Hot Program Streams Whole Program Streams d b a a b

Daedalus Highlights • Data reference representations • 100 to 10,000 times smaller than data reference trace • Data restructuring recommendations • Improved execution time of several programs by 8—15% with small header file modifications • Custom heap allocators • Automatically reduced working set size by up to 40% and TLB misses by up to 90% • In-progress • Automatic prefetching, smart copying garbage collection, scalability optimizations, dynamic on-line optimizations

What’s This Got to Do with the Internet? • Approach remains the same • Record interesting behavior (e.g. network packets) • Reduce large data volumes • Compression, summarization, presenting differences, etc. • Find interesting patterns that correspond to performance (security, correctness) issues • Display information using visualizations / abstractions that match the problem domain • An easier problem? • It will take time to know for sure…

Summary • What’s my Dad to think? • Internet Systems are usable today…but extremely complex • Ability to understand existing systems is immature • Technology still rapidly changing, following Moore’s Law curve • But… • Microsoft’s .NET initiative sets the stage for our opportunity and challenges • Our approach is pragmatic, effective

More Information • Related Resources • MSR, PPRC, PMA: http://research.microsoft.com/pprc/pma.asp • Vulcan • Srivastava et.al., “Binary Transformation in a Distributed Environment”, MSR Technical Report • Srivastava, “Emerging Opportunities for Binary Tools”, Keynote Talk, WBT 2000, October 2000. • Daedalus Project • http://research.microsoft.com/users/trishulc/Daedalus.htm • Chilimbi, "Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality", PLDI 2001, June 2001. • Contact me: zorn@microsoft.com

Backup Slides

The Process of Optimizing Performance Suppose performance is poor… Where’s the bottleneck?Who is at fault? How to find out? What tool to use? How to use? How to understand? Will my effort be worth it? I’m happy now… but what about next time?

A Framework for Monitoring Systems • Goals • Collect data at all system levels • Approximate continuous monitoring closely • Component Classes • Sensors (gathering data) • Management (communicate, summarize, store) • Analysis (recognizing patterns and relationships) • Tools (human feedback) • Actuators (take action directly)

Reference Skew (Code Vs. Data)

Understanding and Optimizing the Performance of Internet-based Systems

Understanding and Optimizing the Performance of Internet-based Systems

Presentation Transcript

Server-based Characterization and Inference of Internet Performance

Optimizing Network Performance

Optimizing System Performance

Understanding the Internet

Internet-based monitoring and control of embedded systems

Optimizing Performance

Optimizing Performance

Optimizing Performance of HPC Storage Systems

Optimizing the Location Obfuscation in Location-Based Mobile Systems

OPTIMIZING THE PERFORMANCE OF PLASMA BASED MICROTHRUSTERS*

Optimizing Herbicide Performance

The Performance of μ -Kernel-Based Systems

The Performance of µ-Kernel-Based Systems

Optimizing Performance 2

Understanding Performance Based Bonus

Understanding the Importance of Internet/Web-Based Learning

Security Issues of Internet-Based Systems

Understanding Performance in Operating Systems

The Performance of Micro-Kernel-Based Systems

Optimizing System Performance