1 / 25

Automatic Measurement of Instruction Cache Capacity in X-Ray

A study on self-optimizing software to achieve portable performance by testing optimal values for hardware parameters. Learn about the challenges and tools for automatic measurement in software optimization.

samplej
Download Presentation

Automatic Measurement of Instruction Cache Capacity in X-Ray

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Measurement of Instruction Cache Capacityin X-Ray Kamen Yotov kyotov@us.ibm.com IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department of Computer Science Cornell University QEST'05

  2. Motivation: self-optimizing software • Goal: portable performance • Self-optimizing software • Generates code with parameters whose optimal values depend on the platform (hardware / OS / compiler) • Determines experimentally optimal parameter values • Uses native C compiler to produce library • Examples: ATLAS, FFTW, SPIRAL, … QEST'05

  3. Example: Register Blocking for MMM • Hardware parameters • Number of FP registers (NR) • I-Cache Capacity (ICC) • A simple model for the register tile size for MMM • Yotov et al. IEEE’05 • MU x NU + MU + NU + Temp ≤ NR • KU (unroll of K loop) • does not depend on NR • depends on ICC • Need to know NR and ICC! QEST'05

  4. Why not consult the manuals? • Self-optimizing systems • Require online manuals • Actual hardware values vs. number available for optimization • For software optimization, hardware values may not be relevant • (e.g.) number of hardware registers may not be equal to number of registers available for holding program values (register 0 on SPARC) • Incomplete • Parameters like capacity and line size of off-chip caches vary from model to model • Even same model of computer may be shipped with different cache organizations • Not usually documented in processor manuals • Moving Target QEST'05

  5. Automatic Measurement Tools • lmbench • OS benchmark, some CPU / Memory benchmarks • Larry McVoy, BitMover, Inc. • Carl Staelin, HP • Calibrator • Memory hierarchy benchmark • Stefan Manegold • Centrum voor Wiskunde en Informatica • MOB • Memory hierarchy benchmark • Josep Blanquer, Robert Chalmers • University of California Santa Barbara QEST'05

  6. X-Ray • Set of micro-benchmarks in ANSI C89 • Download and compile on any architecture (portable) • Deduce hardware parameter values from timing results • Some amount of O/S specific code • High-resolution timing routines • Super-page allocation • Currently support Linux • Windows and Solaris, IRIX, and AIX in the works • Paradox • Compiler optimizations may contaminate timing results • Cannot afford to turn off all optimizations QEST'05

  7. Example: Latency of Integer ADD(Step by Step) t = gettime(); r1 += r2; return gettime() – t; Problem: hard to measure small time intervals accurately QEST'05

  8. Step by Step (cont.) t = gettime(); while (--R) //R is number of repetitions r1 += r2; return gettime() – t; Problem: loop overhead QEST'05

  9. Step by Step (cont.) t = gettime(); i = R / U; while (--i) //loop unrolled U times { r1 += r2; r1 += r2; ........ r1 += r2; } return gettime() – t; Problem: compiler optimizations QEST'05

  10. Step by Step (cont.) t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else use(r1,r2); Solution: “volatile int v = 0” QEST'05

  11. Want to measure r1+=r2 Generate C Code from specification <r1+=r2, <r1, r2: int>> volatile int v = 0; volatile int vr = 0; register int r1 = vr; register int r2 = vr; t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else { vr = r1; vr = r2; } Latency of integer ADD: nano-benchmark C code QEST'05

  12. X-Ray architecture QEST'05

  13. Specification Control Engine Instruction Throughput N=3, B=1: QEST'05

  14. Micro-benchmarks in X-Ray • CPU • Frequency • Instruction Latency • Instruction Throughput • Instruction Existence • FPU on embedded processors • FMA on general purpose processors • SMP and SMT • Memory Hierarchy • Number of Registers of various types (int, float, SSE, …) • Multilevel Caches, TLB • Associativity • Block Size • Capacity • Latency • Instruction Cache Capacity QEST'05

  15. Previous Approaches for Memory Hierarchy Parameters • Saavedra Benchmark (Hennessy-Patterson) • Accesses elements of an array constant stride apart • Measures average memory access time • Deficiencies • Considers all levels simultaneously • Works only for capacities that are powers-of-2 • Suffers from a number of implementation level deficiencies • Constant stride accesses • Loop overhead problems • Overlapping memory operations • Prone to compiler “optimizations” QEST'05

  16. Example:Isolation of lower cache levels • Idea for Ln measurements • Use sequences as for L1 measurements • Make L1…Ln-1 “transparent” to measurements • Unique in isolating the behavior of Ln so that all higher levels miss • Approach • Use sequences of sequences • Convolution of sequences  = QEST'05

  17. Measuring I-Cache Capacity • Approach for Data Cache does not work • Array of pointers  Code sequence with branches • Such branches are very predictable • Nearly impossible to get precise timing • Measure time to execute special code sequence of size N statements • Find the biggest N for which there is no significant increase in time per statement QEST'05

  18. Nano-benchmark • Similar to Instruction Throughput • Parameters (1, 4) • Grow length N • Code size computed • (char *)&&finish – (char *)&&start QEST'05

  19. Sensitivity • Graph for Pentium M • 9 more in the paper • Performance oscillates • Even after averaging out noise • Cannot wait for jump • Need more robust measurement QEST'05

  20. Control Engine Script • Start with N=256 • Compute • Mean • Standard deviation • For • Binary-search • Detect jump when time is more than QEST'05

  21. Experimental Results QEST'05

  22. Pentium 4 • Does not cache ISA instructions, but uops • Trace cache • Measure the number of instructions • Smoothing in the nano-benchmark: minimum of time in QEST'05

  23. Conclusions • X-Ray: A framework and tool • First to measure instruction cache capacity • Algorithms for precise measurements of some important hardware parameters • Experimental results on many modern architectures • Other X-Ray resources • Memory Hierarchy parameter measurement appeared at SIGMETRICS’05 • CPU parameter measurement appeared at QEST’05 • Improving X-Ray is work in progress… QEST'05

  24. Current and Future Work • 2-address vs. 3-address code • Out-of-Order execution • Number Physical registers • Number / Type Functional Units • Cache • bandwidth • write mode • sharedness • replacement policy QEST'05

  25. Thank you! • My E-Mail • kamen@yotov.org • kyotov@us.ibm.com • Cornell Group homepage • http://iss.cs.cornell.edu • This work emerged from a joint project with David Padua’s group at UIUC • http://polaris.cs.uiuc.edu/newframework.html • Download X-Ray! • http://iss.cs.cornell.edu/software/x-ray.aspx QEST'05

More Related