240 likes | 258 Views
DCPI. Performance monitoring on HP Alpha using DCPI. Paul J. Drongowski Hewlett Packard Corporation Paul.Drongowski@hp.com 10 September 2002. DCPI. Objectives for this presentation. Give a brief introduction to the HP Continuous Profiling Infrastructure (DCPI)
E N D
DCPI Performance monitoring onHP Alpha using DCPI Paul J. Drongowski Hewlett Packard Corporation Paul.Drongowski@hp.com 10 September 2002
DCPI Objectives for this presentation • Give a brief introduction to the HP Continuous Profiling Infrastructure (DCPI) • Present instruction sampling, a technique to precisely assign hardware events to instructions on out-of-order execution architecture • Demonstrate the accuracy of instruction sampling as applied to a small floating point program kernel
DCPI HP Continuous Profiling Infrastructure • In daily use with hundreds of registered customers • Application and system profiler • System-wide data collection and analysis • Practical applications include: • Troubleshooting performance • Driving compiler and post-link optimization (SPIKE) • Guiding hardware/software architectural design decisions • “The goal of the continuous profiling project is to produce runtime execution profiles of unmodified Alpha UNIX programs with such low overhead that customers boot with profiling turned on and don’t turn it off. Continuous profiling is [part of a larger project] to substantially improve the performance of large customer programs.” Dick Sites, 1996.
DCPI Features • Transparent • Comprehensive, system-wide profiles • Don’t need to modify source or binary • Continuous profiling with low overhead (2% to 5%) • Incorporates many novel, patented techniques • Instruction sampling • Aggregation during data collection • Stall blame analysis • Value profiling
DCPI Definition: Conventional sampling • AKA “PC sampling” • Hardware counter counts occurrence of an event • Hardware triggers interrupt on overflow • Device driver associates overflow with program counter value • Builds up a profile of dynamic program behavior (e.g., number of times each instruction retired, number of I-cache misses for each instruction)
DCPI Problem: skew and smear • In-order instruction execution • Sequential issue • Predictable order of instruction execution • Predictable skew, very little smear • Alpha implementations: 20164, 21164 • Out-of-order instruction execution • Non-sequential issue based on data-/resource-readiness • Unpredictable order of instruction execution • Unknown skew and smear over in-flight instruction window • Alpha implementations: 21264, 21364
DCPI ProfileMe instruction sampling • Implemented in Alpha 21264A and later • Eliminate skew and smear • Basic approach • Randomly select an instruction to monitor • Capture event information as instruction executes in pipeline • Trigger interrupt when instruction completes or aborts • Collect and aggregate instruction information/events • Program counter value is part of ProfileMe information so event attribution to the instruction is precise
DCPI Experiment: Measurement of instruction execution frequency • Key technique needed to compute FLOPS • How accurate is a sampling-based estimate of individual instruction frequency? • Accuracy • Precision (statistical dispersion) • Bias (not addressed here; subject for additional study) • Method • Run FP kernel 1000 times and capture profile data • Record number of retire samples for FP instruction in inner loop • Record basic block frequency estimation for same instruction • Compute and assess descriptive statistics
DCPI Example FP kernel /* Matrix-Matrix multiply */ for (i=0;i<INDEX;i++) for(j=0;j<INDEX;j++) for(k=0;k<INDEX;k++) mresult[i][j] = mresult[i][j] + matrixa[i][k] * matrixb[k][j] ; • Execution time (667MHz Alpha 21264A) • Without DCPI: 52.58 seconds; with DCPI: 54.90 seconds • Overhead: 4.41% while collecting 25,000 cycle and ProfileMe samples per second • 15,751 retire samples expected per inner loop instruction • (iterations) / (sample period) = 1,000,000,000 / 63,488
DCPI Image-by-image overview • DCPI provides top level view to find candidates for drill down • retired • :count % cum% image • 197520 48.69% 48.69% /dsk0h/dcpidb/PALcode • 157574 38.85% 87.54% flops • 48656 12.00% 99.54% /vmunix • 1451 0.36% 99.89% /usr/shlib/libc.so • 181 0.04% 99.94% /usr/bin/dcpid • 71 0.02% 99.96% /sbin/loader • 51 0.01% 99.97% . . .
DCPI Instruction-by-instruction - ProfileMe • Retire BB • samples freq Address Instruction • 15780 15713 0x120001218 : lds $f10, 0(a4) • 15508 15713 0x12000121c lds $f11, 0(a5) • 15789 15713 0x120001220 addl a3, 0x1, a3 • 15582 15713 0x120001224 lda a4, 4(a4) • 15753 15713 0x120001228 lda t10, -1000(a3) • 15693 15713 0x12000122c lda a5, 4000(a5) • 15920 15713 0x120001230 muls $f10,$f11,$f10 • 15607 15713 0x120001234 adds $f1,$f10,$f1 • 15787 15713 0x120001238 sts $f1, 0(a1) • 15714 15713 0x12000123c blt t10, 0x120001218
DCPI Source line summary - ProfileMe • Retire BB • samples freq Source line • 0 0 /* Matrix-Matrix multiply */ • 0 15713 for (i=0;i<INDEX;i++) • 74 15713 for(j=0;j<INDEX;j++) • 78459 15713 for(k=0;k<INDEX;k++) • 15651 15713 mresult[i][j] = • 15835 15713 mresult[i][j] + • 47322 15713 matrixa[i][k]*matrixb[k][j] • 0 0 ;
DCPI Instruction-by-instruction - Conventional • Retire • samples Address Instruction • 14013 0x120001218 : lds $f10, 0(a4) • 291864 0x12000121c lds $f11, 0(a5) • 22708 0x120001220 addl a3, 0x1, a3 • 0 0x120001224 lda a4, 4(a4) • 0 0x120001228 lda t10, -1000(a3) • 0 0x12000122c lda a5, 4000(a5) • 10794 0x120001230 muls $f10,$f11,$f10 • 0 0x120001234 adds $f1,$f10,$f1 • 5365 0x120001238 sts $f1, 0(a1) • 0 0x12000123c blt t10, 0x120001218
Floating add retires DCPI -sd -2sd +sd +2sd 10 8 6 Frequency 4 2 0 15200 15400 15600 15800 16000 16200 ProfileMe retire samples Retire samples (1000 runs) Minimum sample: 15359 Maximum sample: 16111 Average: 15745.859 Standard deviation: 122.243 Coeff of variation: 0.776% Error: ± 1.552%
Floating add BB freq est DCPI -sd +sd +2sd -2sd 20 15 Frequency 10 5 0 15600 15650 15700 15750 15800 15850 15900 Basic block frequency estimate BB frequency estimates (1000 runs) Minimum sample: 15639 Maximum sample: 15841 Average: 15745.983 Standard deviation: 35.745 Coeff of variation: 0.227% Error: ± 0.454%
DCPI Improving precision • Square root law: quadrupling the sample size doubles the precision of the estimate • Practical techniques for increasing sample size • Execute test for a longer period of time • Aggregate the results of multiple program runs • Increase the sampling rate • DCPI facilitates all three techniques • It supports analysis of very large, long running programs • It automatically aggregates multiple runs • It supports higher, selectable sampling rates
DCPI Conclusions • DCPI – a practical tool for program analysis • An accepted, production-ready tool • Transparent, low-overhead, comprehensive • Pinpoints performance issues at the instruction and source language statement levels • Experimental results • ProfileMe mitigates the effects of smear and skew that are present in conventional sampling on O-O-O machines • Precision as measured in the experiment • Raw samples: ±1.552% • Basic block frequency estimation: ±0.454% • Basic block frequency analysis substantially improves precision • DCPI and ProfileMe technology can be applied to architectures other than Alpha such as IA-32 and IPF
DCPI HP Continuous Profiling Infrastructure • Offered as an “Advanced Development Kit” on Tru64 UNIX • Agree to on-line field test agreement • Current version is 3.9.6 • Contact: dcpi@hp.com • URL: http://www.tru64unix.compaq.com/dcpi
DCPI Main components and their roles • Performance counters monitor and count CPU events • Device driver collects samples and performs first level data aggregation • Daemon image correlation and second level aggregation • Database stores profile data by epoch, host, image • Tools access, analyze and present profile information Performance counters Device driver Daemon Database Tools Denotes flow of profile data
DCPI ProfileMe instruction information • Program counter • Instruction was a regular/PAL instruction • Pipeline trap occurred • Misprediction trap occurred • Load-store order trap occurred • Pipeline trap type • Instruction was not yet prefetched • Instruction was killed before register mapping • Instruction stalled before register mapping • Instruction retired without trapping (valid) • Branch was taken • Auxiliary counts: retire delay, retires in profiling window
DCPI Image-by-image overview • Many aborted instructions due to DTB misses • Retire Abort DTBmiss • samples samples samples Image • 97480 9699 1 /dsk0h/dcpidb/PALcode • 78253 200815 3878 flops • 24766 597 1 /vmunix • 668 148 0 /usr/shlib/libc.so • 49 31 0 /usr/bin/dcpid • 30 3 0 /sbin/loader
DCPI DTB misses • DTB miss summary for inner loop of flops program • Retire Abort DTBmiss • samples samples samples Freq Address Instruction • 7822 18283 39 7805 0x120001218 : lds $f10, 0(a4) • 7677 22059 3812 7805 0x12000121c lds $f11, 0(a5) • 7747 19259 0 7805 0x120001220 addl a3, 0x1, a3 • 7797 19377 0 7805 0x120001224 lda a4, 4(a4) • 7980 19338 0 7805 0x120001228 lda t10, -1000(a3) • 7865 19415 0 7805 0x12000122c lda a5, 4000(a5) • 7682 20811 0 7805 0x120001230 muls $f10,$f11,$f10 • 7831 20576 0 7805 0x120001234 adds $f1,$f10,$f1 • 7865 20715 27 7805 0x120001238 sts $f1, 0(a1) • 7785 20727 0 7805 0x12000123c blt t10, 0x120001218
DCPI Dynamic Access to DCPI Data (DADD) • Provide dynamic, runtime access to performance data • Client / server relationship • Application (client) registers interest with DCPI daemon (server) • Daemon serves data to application via shared memory region • “Virtual counter” API • Daemon summarizes profile information into event counts • Event counts are written periodically into shared memory region • DADD provides virtual counters to PAPI implementation • Status: Experimental prototype under development Performance counters Device driver Daemon (server) Application (client) Denotes flow of profile/performance data