770 likes | 781 Views
Explore the powerful Origin2400 architecture, code optimization tools, cache strategies, memory types, and system scaling. Understand cache coherence, interconnection networks, and hardware performance for optimized code development.
E N D
Course Outline • Origin2400 Architecture • Code development and optimization tools • Cache optimization • User Environment
CPU CPU CPU Memory Memory Memory CPU CPU CPU CPU Memory Types Shared CPU Memory Memory Distributed
Origin2400 Architecture ccNUMA cache coherent - Non-uniform memory access • Physically distributed, globally addressable memory • Hardware cache coherence Scalable shared memory systems Easy to program Easy to scale - to a point Massively parallel distributed memory systems Bus-based shared memory systems Easy to program Hard to scale Hard to program Easy to scale
Hub P P Memory Node • Two R12000 Processors (400MHz) • 64MB-4GB memory (1 GB) • Hub (interface)
Hub Hub Hub Hub Hub P P P P P P P P P P Memory Memory Memory Memory Memory System Scaling R
Hub Hub Hub Hub Hub Hub Hub Hub Hub Hub P P P P P P P P P P P P P P P P P P P P Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory System Scaling R R R R R
Hub Hub Hub Hub Hub Hub Hub Hub P P P P P P P P P P P P P P P P Memory Memory Memory Memory Memory Memory Memory Memory System Scaling R R R R
Origin Node Board • Two R12000 processors • 1 GB main memory • Additional directory memory - used for cache coherence • Sockets for extra directory memory for systems with more than 32 processors • Hub interconnect chip Memory Directory XIO Directory (>32proc) Hub NUMALink R12000 R12000 L2 cache L2 cache
Origin Module • Each router has six connections, two to nodes and four to other routers. • Systems with 32 or fewer processors will have extra router ports available and can use these for “express” links Node 0 XBOW Router 0 Node 1 Node 2 XBOW Router 1 Node 3 NUMALink XIO
Hub Hub P P P P Memory Memory Cache Coherence • Directory maintains state information for each L2 cache line in memory. • States • unowned - not cached • exclusive - 1 r/w copy • shared - 1+ r/o copies • poisoned - migrated to another node • Directory includes a bit vector indicating processors with a copy of the cache line Interconnection network . . . c c c c directory directory
Cache Architecture • L1 D-cache is 2-way set associative, LRU, writeback, 8 word lines, non-blocking • L2 cache is 2-way set associative, LRU, writeback, 32 word lines, non-blocking R12000 I register D register 128 64 32KB Icache 32KB Dcache 128 ~10 cycles/miss 8MB L2 Cache ~60+ cycles/miss 780MB/s
Translation Lookaside Buffer • TLB is used to translate virtual addresses to physical addresses • R10000 TLB has 64 entries. Each entry can translate addresses for 2 pages (default page size is 16KB) • TLB miss costs about the same as a cache miss and causes similar performance issues.
Origin2000Bandwidths and Latencies Physical Peak Payload Peak Read Bandwidths in GB/s Latencies in ns
R12000 Architecture • Superscalar • 400 MHz clock • 4 instructions/cycle • Cache • 8MB L2 cache • dedicated cache bus • interleaved cache access • non-blocking • Out-of-order Execution • 3 instruction queue • Branch Prediction
Superscalar Architecture Fetch/decode up to 4 instructions/cycle Execute up to 4 instructions/cycle from 5 execution units Load/store ALU1 ALU2 FPADD FPMUL Instruction set binary compatible with R8000 and R4000 32-bit and 64-bit instructions 32 integer registers 32 floating-point registers R12000 Architecture
Origin2400Architecture References • www.sgi.com/origin/2000 • techpubs.sgi.com
Code Porting/Optimization Objectives • Get the right answers • Identify resource consuming code sections • Utilize optimized system libraries • Let the compiler do the work
Porting Issues(getting the right answers) • Application Binary Interface (ABI) • 32 • n32 (default/recommended for codes <2 GB total memory) • 64 (required for codes with >2GB total memory) • Instruction Set Architecture (ISA) • mips2 • mips3 • mips4 (default) defaults found from file /etc/compiler.defaults
Profiling Tools • perfex - overall code performance • SpeedShop - procedure level performance data • dprof - memory access patterns
Can select from 32 events Two counter registers (can fully count two events per code execution) 0 - cycles 1 - issued instructions 2 - issued loads 3 - issued stores 4 - issued conditionals 5 - failed conditionals 6 - branches resolved 7 - quadwords written back from s-cache 8 - s-cache data errors (ECC) 9 - I-cache misses 10 - L2 cache miss - instruction 11 - instruction misprediction 12 - external interventions 13 - external invalidations 14 - function unit completion cycles 15 - graduated instructions R12000Hardware Performance Registers
Each counter can be set to count one of 16 events counter 0 can count events 0-15 counter 1 can count events 16-31 Counter registers are 32 bit registers. Can be set to generate an interrupt on overflow. 16 - cycles 17 - graduated instructions 18 - graduated loads 19 - graduated stores 20 - graduated store conditionals 21 - graduated floating-point instructions 22 - quadwords written back from d-cache 23 - TLB misses 24 - mispredicted branches 25 - d-cache misses 26 - s-cache misses - data 27 - data misprediction 28 - external intervention s-cache hits 29 - external invalidation s-cache hits 30 - store/prefetch excl to clean block 31 - store/prefetch excl to shared block R12000Hardware Performance Registers
No special compilation needed Can monitor two counters exactly - OR Can monitor all counters (each 1/16th of the time) values then multiplied by 16 to approximate full counts Option to convert counts to estimated times % perfex -a -y -o data code.x perfex All counters Redirect output Estimate times
perfexOutput Edited for presentation Based on 250 MHz IP27 Event definitions for cpu version 3.x Typical Event Counter Name Counter Value Time (sec) ========================================================================================= 0 Cycles...................................................... 898600299008 3594.401196 16 Cycles...................................................... 898600299008 3594.401196 26 Secondary data cache misses................................. 7034639424 2124.461106 7 Quadwords written back from scache.......................... 18935563200 484.750418 25 Primary data cache misses................................... 7449172608 268.468181 2 Issued loads................................................ 59030982976 236.123932 14 ALU/FPU forward progress cycles............................. 48181262304 192.725049 18 Graduated loads............................................. 46436171712 185.744687 3 Issued stores............................................... 19988999248 79.955997 22 Quadwords written back from primary data cache.............. 4971802640 76.565761 19 Graduated stores............................................ 18055579056 72.222316 6 Decoded branches............................................ 5225243088 20.900972 21 Graduated floating point instructions....................... 2699848928 10.799396 24 Mispredicted branches....................................... 1033609888 5.870904 9 Primary instruction cache misses............................ 374656 0.027005
perfexOutput 23 TLB misses.................................................. 1904 0.000519 10 Secondary instruction cache misses.......................... 256 0.000077 4 Issued store conditionals................................... 160 0.000001 20 Graduated store conditionals................................ 32 0.000000 30 Store/prefetch exclusive to clean block in scache........... 32 0.000000 1 Issued instructions......................................... 147707069072 0.000000 5 Failed store conditionals................................... 0 0.000000 8 Correctable scache data array ECC errors.................... 0 0.000000 11 Instruction misprediction from scache way prediction table.. 512 0.000000 12 External interventions...................................... 2525856 0.000000 13 External invalidations...................................... 7415216 0.000000 15 Graduated instructions...................................... 136445826704 0.000000 17 Graduated instructions...................................... 136469377216 0.000000 27 Data misprediction from scache way prediction table......... 804101376 0.000000 28 External intervention hits in scache........................ 1744336 0.000000 29 External invalidation hits in scache........................ 3193680 0.000000 31 Store/prefetch exclusive to shared block in scache.......... 0 0.000000
perfexOutput Statistics ========================================================================================= Graduated instructions/cycle................................................ 0.151843 Graduated floating point instructions/cycle................................. 0.003005 Graduated loads & stores/cycle.............................................. 0.071769 Graduated loads & stores/floating point instruction......................... 23.887170 Mispredicted branches/Decoded branches...................................... 0.197811 Graduated loads/Issued loads................................................ 0.786641 Graduated stores/Issued stores.............................................. 0.903276 Data mispredict/Data scache hits............................................ 1.939776 Instruction mispredict/Instruction scache hits.............................. 0.001368 L1 Cache Line Reuse......................................................... 7.657572 L2 Cache Line Reuse......................................................... 0.058927 L1 Data Cache Hit Rate...................................................... 0.884494 L2 Data Cache Hit Rate...................................................... 0.055648 Time accessing memory/Total time............................................ 0.737507 Time not making progress (probably waiting on memory) / Total time.......... 0.946382 L1--L2 bandwidth used (MB/s, average per process)........................... 88.449327 Memory bandwidth used (MB/s, average per process)........................... 334.799259 MFLOPS (average per process)................................................ 0.751126 Not good
SpeedShop • No special compilation needed • Provides the following types of profiling • Program counter sampling • Ideal time • User time • Hardware counter profiling • Floating-point exception tracing • Heap tracing
PC Sampling Provides estimate of time spent by each function in executable Two step process: execute code with ssrun use prof to examine results %ssrun -pcsamp prog %prof prog.pcsamp.4324 SpeedShop
pcsamp output Summary of statistical PC sampling data (pcsamp)-- 13060: Total samples 130.600: Accumulated time (secs.) 10.0: Time per sample (msecs.) 2: Sample bin width (bytes) ------------------------------------------------------------------------- Function list, in descending order by time ------------------------------------------------------------------------- [index] secs % cum.% samples function (dso: file, line) [1] 58.230 44.6% 44.6% 5823 zaver (prog: prog.f, 69) [2] 37.490 28.7% 73.3% 3749 yaver (prog: prog.f, 50) [3] 34.460 26.4% 99.7% 3446 xaver (prog: prog.f, 31) [4] 0.420 0.3% 100.0% 42 main (prog: prog.f, 1) 130.600 100.0% 100.0% 13060 TOTAL
Ideal time Estimates best possible time the code could achieve - by routine Useful for identifying routines with cache problems % ssrun -ideal prog beginning libraries /usr/lib32/libssrt.so /usr/lib32/libftn.so /usr/lib32/libm.so ending libraries, beginning prog % prof prog.ideal.3453 SpeedShop
ideal output Summary of ideal time data (ideal)-- 23468025764: Total number of instructions executed 26959868891: Total computed cycles 107.839: Total computed execution time (secs.) 1.149: Average cycles / instruction ------------------------------------------------------------------------- Function list, in descending order by exclusive ideal time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% cycles instructions calls function (dso: file, line) [1] 36.133 33.5% 33.5% 9033236300 7740175400 100 zaver (prog: prog.f, 69) [2] 35.737 33.1% 66.6% 8934236300 7839175400 100 xaver (prog: prog.f, 31) [3] 35.737 33.1% 99.8% 8934236300 7839175400 100 yaver (prog: prog.f, 50) [4] 0.221 0.2% 100.0% 55184326 46134726 1 main (prog: prog.f, 1) Hundreds more lines of library calls omitted
Hardware Counter Profiling prof_hwd Counter selected with environment variable _SPEEDSHOP_HWC_COUNTER_NUMBER Most commonly used counters have experiment names gi_hwc – graduated instructions cy_hwc – cycles ic_hwc – L1 Icache miss isc_hwc – L2 Icache miss dc_hwc – L1 Dcache miss dsc_hwd – L2 Dcache miss tlb_hwc – TLB miss gfp_hwc – graduated FP instructions SpeedShop
The –b or –gprof options to prof will generate a dynamic calling tree. Procedures are listed by calling and called by. SpeedShop
One of the Workshop tools, cvperf, provides a GUI interface to view the SpeedShop experiment results WorkShop
Workshop • ssusage • Speed shop program • runs executable and prints resources used • Useful for finding out memory use • ssusage mypgm
Workshop also includes a debugger, cvd The common UNIX debugger, dbx, is also available WorkShop
Other WorkShop components include cvbuild – build dependency analyzer cvstatic – static source analyzer WorkShop can be configured to work with a source code revision control system (see cvconfig) cvpav – parallel analysis for MP Fortran programs WorkShop
Performance Libraries • fastm • Fast transcendental library • Link w/ -lfastm • Faster results at the trade off of some accuracy • See man libfastm • SCSL • Scientific Computing Software Library • See man intro_scsl and man pages referenced therein • Signal processing including FFT, correlation, convolution • LAPACK • Linear solvers • Matrix and Vector routines
MIPSpro Compilers CC cc f90 f77 Optimizations Software pipelining (SWP) Inter-procedural analysis (IPA) Loop nest optimizations (LNO) Compilers
-O[n] 0 => no optimization – use only for debugging (default!) 1 => simple optimizations 2 => conservative optimizations, should not alter results If just -O is specified, -O2 is invoked Fast => -O3 –IPA –OPT :roundoff=3:alias=typed 3 => SWP, LNO, and other aggressive optimizations, may alter results Compilers
-OPT IEEE_arithmetic=n – conformance with IEEE floating-point arithmetic 1 (default) compliant 2 inexact results may differ (not-a-number, infinity) 3 allows arbitrary, mathematically valid transformations roundoff=n – acceptable round off altering optimization 0-3 where 0 is none and 3 is any alias=n – pointer aliasing model Compilers
-OPT:alias=<name> ANY, COMMON_SCALAR ANY is default TYPED, NO_TYPED Different base types point to distinct objects UNNAMED, NO_UNNAMED Pointers never point to named objects RESTRICT, NO_RESTRICT Distinct pointers point to distinct, non-overlapping objects parm, no_parm Fortran only Compilers Do not lie to the compiler!
Software Pipelining do i=1,n y(i) = y(i) + a*x(i) enddo Each loop iteration contains 2 loads, 1 store 1 multiply-add 2 address increments Loop end test, branch Superscalar processor slots 1 load/store 1 ALU1, 1ALU2 1 FP add 1 FP multiply Compilers
Load/store FPADD FPMUL clock ALU1 ALU2 operations 0 1 Load x 2 Load y 3 x++ 4 madd 5 Store y 6 branch 7 y++ Software Pipelining 2 flop / 8 cycles achieved 16 flop / 8 cycles peak Running 1/8th of peak performance
Software Pipelining • Pipelined daxpy • Load/store is bottleneck • Optimize to fully utilize load/store unit Load/store FPADD FPMUL clock 0 1 2 3 4 5 6 7 8 8 flop / 14 cycles achieved 28 flop / 14 cycles peak Running better than 1/4th of peak performance 9 10 11 12 13
Use –O3 to enable pipelining Vectorizable loops are well suited for pipelining SWP cannot be done if loop contains Function calls Complicated conditionals Branching SWP is impeded by Recurrences between iterations (can use IVDEP directive) Very long loop (split loop) Register overflow (split loop) SWP algorithms are heuristic Schedules are not unique Finding schedule may be computationally expensive Software Pipelining
Analyzes entire program Precedes other optimizations Performs optimizations across procedure boundaries Invoke with -IPA Compile step will finish quickly – link step will take much longer If any procedure changes must recompile full program Inter-Procedural Analysis