Automatic Measurement of Instruction Cache Capacity in X-Ray

Automatic Measurement of Instruction Cache Capacityin X-Ray Kamen Yotov kyotov@us.ibm.com IBM T. J. Watson Research Center Joint work with: Tyler Steele, Sandra Jackson, Keshav Pingali, Paul Stodghill Department of Computer Science Cornell University QEST'05

Motivation: self-optimizing software • Goal: portable performance • Self-optimizing software • Generates code with parameters whose optimal values depend on the platform (hardware / OS / compiler) • Determines experimentally optimal parameter values • Uses native C compiler to produce library • Examples: ATLAS, FFTW, SPIRAL, … QEST'05

Example: Register Blocking for MMM • Hardware parameters • Number of FP registers (NR) • I-Cache Capacity (ICC) • A simple model for the register tile size for MMM • Yotov et al. IEEE’05 • MU x NU + MU + NU + Temp ≤ NR • KU (unroll of K loop) • does not depend on NR • depends on ICC • Need to know NR and ICC! QEST'05

Why not consult the manuals? • Self-optimizing systems • Require online manuals • Actual hardware values vs. number available for optimization • For software optimization, hardware values may not be relevant • (e.g.) number of hardware registers may not be equal to number of registers available for holding program values (register 0 on SPARC) • Incomplete • Parameters like capacity and line size of off-chip caches vary from model to model • Even same model of computer may be shipped with different cache organizations • Not usually documented in processor manuals • Moving Target QEST'05

Automatic Measurement Tools • lmbench • OS benchmark, some CPU / Memory benchmarks • Larry McVoy, BitMover, Inc. • Carl Staelin, HP • Calibrator • Memory hierarchy benchmark • Stefan Manegold • Centrum voor Wiskunde en Informatica • MOB • Memory hierarchy benchmark • Josep Blanquer, Robert Chalmers • University of California Santa Barbara QEST'05

X-Ray • Set of micro-benchmarks in ANSI C89 • Download and compile on any architecture (portable) • Deduce hardware parameter values from timing results • Some amount of O/S specific code • High-resolution timing routines • Super-page allocation • Currently support Linux • Windows and Solaris, IRIX, and AIX in the works • Paradox • Compiler optimizations may contaminate timing results • Cannot afford to turn off all optimizations QEST'05

Example: Latency of Integer ADD(Step by Step) t = gettime(); r1 += r2; return gettime() – t; Problem: hard to measure small time intervals accurately QEST'05

Step by Step (cont.) t = gettime(); while (--R) //R is number of repetitions r1 += r2; return gettime() – t; Problem: loop overhead QEST'05

Step by Step (cont.) t = gettime(); i = R / U; while (--i) //loop unrolled U times { r1 += r2; r1 += r2; ........ r1 += r2; } return gettime() – t; Problem: compiler optimizations QEST'05

Step by Step (cont.) t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else use(r1,r2); Solution: “volatile int v = 0” QEST'05

Want to measure r1+=r2 Generate C Code from specification <r1+=r2, <r1, r2: int>> volatile int v = 0; volatile int vr = 0; register int r1 = vr; register int r2 = vr; t = gettime(); i = R / U; switch (v) { case 0: loop: case 1: r1 += r2; case 2: r1 += r2; ................. case U: r1 += r2; if (--i) goto loop; } if (!v) return gettime() – t; else { vr = r1; vr = r2; } Latency of integer ADD: nano-benchmark C code QEST'05

X-Ray architecture QEST'05

Specification Control Engine Instruction Throughput N=3, B=1: QEST'05

Micro-benchmarks in X-Ray • CPU • Frequency • Instruction Latency • Instruction Throughput • Instruction Existence • FPU on embedded processors • FMA on general purpose processors • SMP and SMT • Memory Hierarchy • Number of Registers of various types (int, float, SSE, …) • Multilevel Caches, TLB • Associativity • Block Size • Capacity • Latency • Instruction Cache Capacity QEST'05

Previous Approaches for Memory Hierarchy Parameters • Saavedra Benchmark (Hennessy-Patterson) • Accesses elements of an array constant stride apart • Measures average memory access time • Deficiencies • Considers all levels simultaneously • Works only for capacities that are powers-of-2 • Suffers from a number of implementation level deficiencies • Constant stride accesses • Loop overhead problems • Overlapping memory operations • Prone to compiler “optimizations” QEST'05

Example:Isolation of lower cache levels • Idea for Ln measurements • Use sequences as for L1 measurements • Make L1…Ln-1 “transparent” to measurements • Unique in isolating the behavior of Ln so that all higher levels miss • Approach • Use sequences of sequences • Convolution of sequences  = QEST'05

Measuring I-Cache Capacity • Approach for Data Cache does not work • Array of pointers  Code sequence with branches • Such branches are very predictable • Nearly impossible to get precise timing • Measure time to execute special code sequence of size N statements • Find the biggest N for which there is no significant increase in time per statement QEST'05

Nano-benchmark • Similar to Instruction Throughput • Parameters (1, 4) • Grow length N • Code size computed • (char *)&&finish – (char *)&&start QEST'05

Sensitivity • Graph for Pentium M • 9 more in the paper • Performance oscillates • Even after averaging out noise • Cannot wait for jump • Need more robust measurement QEST'05

Control Engine Script • Start with N=256 • Compute • Mean • Standard deviation • For • Binary-search • Detect jump when time is more than QEST'05

Experimental Results QEST'05

Pentium 4 • Does not cache ISA instructions, but uops • Trace cache • Measure the number of instructions • Smoothing in the nano-benchmark: minimum of time in QEST'05

Conclusions • X-Ray: A framework and tool • First to measure instruction cache capacity • Algorithms for precise measurements of some important hardware parameters • Experimental results on many modern architectures • Other X-Ray resources • Memory Hierarchy parameter measurement appeared at SIGMETRICS’05 • CPU parameter measurement appeared at QEST’05 • Improving X-Ray is work in progress… QEST'05

Current and Future Work • 2-address vs. 3-address code • Out-of-Order execution • Number Physical registers • Number / Type Functional Units • Cache • bandwidth • write mode • sharedness • replacement policy QEST'05

Thank you! • My E-Mail • kamen@yotov.org • kyotov@us.ibm.com • Cornell Group homepage • http://iss.cs.cornell.edu • This work emerged from a joint project with David Padua’s group at UIUC • http://polaris.cs.uiuc.edu/newframework.html • Download X-Ray! • http://iss.cs.cornell.edu/software/x-ray.aspx QEST'05

Automatic Measurement of Instruction Cache Capacity in X-Ray

Automatic Measurement of Instruction Cache Capacity in X-Ray

Presentation Transcript

x-ray

Automatic Measurement of Instruction Cache Capacity in X-Ray

X-RAY

X-RAY

Production of x-ray

Improving Instruction Cache Performance in OLTP

X-ray absorption in a GEM: Production of X-ray X-ray absorbsion in GEM detectors

X-RAY

X RAY OF HEART

Measurement: Capacity

Software for X-ray Scattering Measurement

X-RAY

X-Ray

X-RAY

Compressed Instruction Cache

Soft X-Ray pulse length measurement

Compressed Instruction Cache

X-ray

X-Ray