1 / 63

Origin System Architecture

Origin System Architecture. Hardware and Software Environment. Scalar Architecture. memory. Register File. Functional Unit (mult, add). Cache. Processor. Reduced Instruction Set (RISC) Architecture: load/store instructions refer to memory

krysta
Download Presentation

Origin System Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Origin System Architecture Hardware and Software Environment

  2. Scalar Architecture memory Register File Functional Unit (mult, add) Cache Processor • Reduced Instruction Set (RISC) Architecture: • load/store instructions refer to memory • functional units operate on items in the register file • memory hierarchy in the Scalar Architecture • Most recently used items are captured in the cache • Access to cache is much faster than access to memory ~2GB/s ~10 cy ~500 MB/s ~100 cycles

  3. Vector Architecture Vector Registers Functional Unit (mult, add) Processor k i i = X k C A B • Vectors will be loaded (loadv instruction) from memory • The performance is determined by memory bandwidth • Optimization takes vector length (64 words) into account Vector Operation DO i=1,n DO k=1,n C(i,1:n)=C(i,1:n) + A(i,k)*B(k,1:n) ENDDO ENDDO memory loadf f2,(r3) load scalar A(i,k) loadv v3,(r3) load vector B(k,1:n) mpyvs v3,v3,v2 calculate A(I,k)*B(k,1:n) addvv v4,v4,v3 update C(I,1:n) + Accumulate C(1,1:n) in a vector register

  4. Multiprocessor Architecture memory Register File Register File Functional Unit (mult, add) Functional Unit (mult, add) Cache Cache Cache Coherency Unit Cache Coherency Unit Processor Processor • Cache coherency unit will intervene if two or more processors attempt to update same cache line • All memory (and I/O) is shared by all processors • Read/write conflicts between processors on the same memory location are resolved by cache coherency unit • Programming model is an extension of single processor programming model

  5. Multicomputer Architecture Main memory Main memory Register File Register File Functional Unit (mult, add) Functional Unit (mult, add) Cache Cache Processor Processor • All memory and I/O path are independent • Data movement across the interconnect is “slow” • Programming model is based on message passing • Processors explicitly engage in communication by sending and receiving data

  6. Origin 2000 Node Board Main Memory Directory Basic Building Block • 2 X R12000 Processors • 64 MB to 4 GB Main Memory • Hub Bandwidth Peaks • 780 MB/s [625] --- CPUs • 780 MB/s [683] --- memory • 1.56 GB/s [1.25] -- XIO link • 1.56 GB/s [1.25] -- CrayLink XIO Directory >32P Hub CrayLink R1*K R1*K Cache Cache Node Board

  7. O2000 Node Board HUB Crossbar ASIC: • Single chip integrates all 4 Interfaces: • Processor Interface; two R1x000 processors multiplex on the same bus • Memory Interface, integrating the memory controller and (Directory) Cache Coherency • Interface to the CrayLink Interconnect to other nodes in the system • Interface to the I/O devices with XIO-to-PCI bridges • Memory Access characteristics: • Read Bandwidth single processor 460 MB/s sustained • Average access latency 315 ns to restart processor pipeline Directory SDRAM Main Memory up to 4 GB/node SDRAM (144@50 MHz=800MB/s) L2 Cache 1-4-8 MB Memory Interface CrayLink duplex connection (2x23@400 MHz, 2x800 MB/s) to other nodes R1x000 processor HUB Link Interface Proc Interface R1x000 processor I/O Interface HUB ASIC: 950K gates 100MHz 64bit BTE 64 counters /(4KB)page L2 Cache 1-4-8 MB Input/Output on every node: 2x800 MB/s

  8. Origin 2000 Switch Technology Main Memory Directory N N N N N N N N N N N N N N N N R R R R R R R R 6 ports to XIO Directory >32P XBOW Hub Proc. Proc. Cache Cache Node Board ccNUMA hypercube Router to other Node Boards

  9. O2000 Scalability Principle • Distributed switch does scale: • Network of crossbars allows for full remote bandwidth • The switch components are distributed and modular Main Memory Directory SDRAM Directory SDRAM Main Memory L2 Cache 1-4-8 MB L2 Cache 1-4-8 MB Memory Interface Memory Interface R1x000 processor R1x000 processor HUB Proc Interface Link Interface HUB Link Interface Proc Interface R1x000 processor R1x000 processor I/O Interface I/O Interface L2 Cache 1-4-8 MB L2 Cache 1-4-8 MB Crossbar router network

  10. Origin 2000 Module System Building Block • Module Features: • Up to 8 R12000 CPUs (1-4 Nodes) • Up to 16 GB physical memory • Up to 12 XIO slots • 2 XBOW Switches • 2 Router Switches • 64 bit internal PCI Bus (optional) • Up to 2.5 [3.1] GB/sec system bandwidth • Up to 5.0 [6.2] GB/sec I/O bandwidth

  11. Origin 2000 Module N N N N R R • Deskside System • 2-8 CPUs • 16GB Memory • 12 XIO slots • SGI 2100 / 2200

  12. Origin 2000 Single Rack N N N N N N N N R R R R • Single Rack System • 2-16 CPUs • 32GB Memory • 24 XIO slots • SGI 2400

  13. Origin 2000 Multi-Rack N N N N N N N N N N N N N N N N R R R R R R R R • Multi-Rack System • 17-32 CPUs • 64GB Memory • 48 XIO slots • 32-processor hypercube building block

  14. Origin 2000 Large Systems • Large Multi-Rack Systems • up to 512 CPUs • up to 1 TB Memory • 384+ XIO slots • SGI 2800 + + + =

  15. ScalableNode Product Concept Address diverse customer requirements Independent scaling of CPU, I/O, and storage…tailor ratios to suit application Large dynamic range of product configurations RAS via component isolation Independent evolution and upgrade of system components Maximize leverage of engineering and technology development efforts INTERCONNECT SUBSYSTEMS PROCESSOR SUBSYSTEMS Modular Architecture Interface and Form Factor Standards I/O SUBSYSTEMS

  16. Origin 3000 Hardware Modules (BRICKS) G-brick Graphics Expansion C-brickCPU Module R-brick Router Interconnect I-brick Base I/O Module P-brick PCI Expansion X-brick XIO Expansion D-brick Disk Storage

  17. Origin 3000 MIPS Node R1*000 R1*000 R1*000 R1*000 Bedrock ASIC Mem/Dir Two Independent SysAD Interfaces Each 2x O2K Bandwidth 200 MHz, 1600 MB/sec each 128 Nodes / 512 CPUs per System (Max) L2 Cache L2 Cache L2 Cache L2 Cache Memory Interface 4x O2K Bandwidth 200 MHz, 3200 MB/sec 60% O2K Latency 180 ns local 8 GB/node (Max) DDR SDRAM NUMALink3 Network Port 2x O2K Bandwidth 800 MHz, 1600 MB/sec Bi-directional XIO+ Port 1.5x O2K Bandwidth 600 MHz, 1200 MB/sec Bi-directional

  18. Origin 3000 CPU Brick (C-brick) • 3U high x 28” deep • Four MIPS or IA64 CPUs • 1 - 4 DIMM pairs: 256MB, 512MB, 1024MB (premium) • 48V DC power input • N+1 redundant, hot-plug cooling • Independent power on/off • Each CPU module can support one I/O brick

  19. Origin 3000 BEDROCK Chip

  20. SGI Origin 3000 Bandwidth Theoretical vs. Measured (MB/s) 900 900 1600 1600 CPU CPU CPU CPU 900 900 1600 1600 CPU CPU CPU CPU 1150 1150 1600 1600 2x1250 2x1600 Hub Hub 2100 3200 Memory Memory node node

  21. STREAMS Copy Benchmark SGI Confidential

  22. Origin 3000 Router Brick (r/R-brick) • 2U high x 25” deep • Replaces system mid-plane • Multiple Implementations • r-Brick…6-port (up to 32 CPUs) • R-Brick…8-port (up to 128 CPUs) • metarouter…(128 to 512 CPUs) • 48V DC power input • N+1 redundant, hot-plug cooling • Independent power on/off • Latency 50% ORIGIN 2000 • 45 ns 8 NUMAlink™ 3 NW Ports Each port...3.2GB/s (2x O2K bandwidth) 45ns roundtrip latency (50% O2K router latency) NUMAlink™ 3 Router

  23. SGI Origin 3000 Measured Bandwidth 5000 MB/s Router 2500 2500

  24. SGI NUMA 3Scalable Architecture (16p - 1hop) R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 R1*000 Bedrock ASIC Bedrock ASIC Bedrock ASIC Bedrock ASIC 8-port Router To other Routers

  25. Origin 3000I/O Bricks I-brick: Base I/O Module P-brick: PCI Expansion X-brick: XIO Expansion • Base system I/O: • system disk • CD-ROM • 5 PCI slots • No need to duplicate starting I/O infrastructure • 12 industry-standard,64-bit, 66MHz slots • Supports almost allsystem peripherals • All slots are hot-swap • Highest performanceI/O expansion • Supports HIPPI,GSN, VME, HDTV • 4 XIO slots per brick New I/O bricks (e.g., PCI-X) can be attached via same XIO+ port

  26. Types of Computer Architecturecharacterised by memory access PVP (SGI/Cray T90) UMA Central Memory SMP (Intel SHV, SUN E10000, DEC 8400 SGI Power Challenge, IBM R60, etc.) COMA (KSR-1, DDM) Multiprocessors Single Address space Shared Memory NUMA distributed memory CC-NUMA (SGI Origin2000, Origin3000, Cray T3E, HP Exemplar, Sequent NUMA-Q, Data General) NCC-NUMA (Cray T3D, IBM SP3) MIMD Cluster (IBM SP2, DEC TruCluster, Microsoft Wolfpack, “Beowolf”, etc.) loosely coupled, multiple OS Multicomputers Multiple Address spaces NORMA no-remote memory access “MPP” (Intel TFLOPS,TM-5) tightly coupled & single OS MIMD Multiple Instruction s Multiple Data PVP Parallel Vector Processor UMA Uniform Memory Access SMP Symmetric Multi-Processor NUMA Non-Uniform Memory Access COMA Cache Only Memory Architecture NORMA No-Remote Memory Access CC-NUMA Cache-Coherent NUMA MPP Massively Parallel Processor NCC-NUMA Non-Cache Coherent NUMA

  27. Origin DSM-ccNUMA Architecture Processor Processor Processor Processor Processor Processor Processor Cache Cache Cache Cache Cache Cache Cache Main Memory Dir DistributedSharedMemory Processor Cache Bedrock XIO+ Bedrock XIO+ Main Memory Dir NUMALink3 and R-Bricks

  28. Distributed Shared Memory Architecture (DSM) Main memory Main memory Register File Register File Functional Unit (mult, add) Functional Unit (mult, add) Cache Cache Processor Processor Cache Coherency Unit Cache Coherency Unit • Local memory and independent path to memory as with the Multicomputer Architecture • Memory of all nodes is organized as one logical “shared memory” • Non-uniform memory access (NUMA): • “Local memory” access is faster than “remote memory” access • Programming model is (almost) the same as for the Shared Memory Architecture • data distribution is available for optimization • Scalability properties similar to the Multicomputer Architecture interconnect

  29. Origin DSM-ccNUMA Architecture Processor Processor Processor Processor Processor Processor Processor Cache Cache Cache Cache Cache Cache Cache Main Memory Dir Directory-BasedScalableCache Coherence Processor Cache Bedrock XIO+ Bedrock XIO+ Main Memory Dir NUMALink3 and R-Bricks

  30. Origin Cache Coherency Data Block or Cache line 128 Bytes (32 words) Data Block or Cache line 128 Bytes (32 words) directory page presence (64 bits) presence (64 bits) state 8bits state 8bits • Memory page is divided in data blocks of 32 words or 128 Bytes each (L2 cache line size) • Each data request transfers one data block (128 Bytes) • Each data block has associated presence and state information • If a node (HUB) requests a data block, the corresponding presence bit is set and the state of that cache line is recorded • HUB runs the Cache Coherency protocol, updating the state of the data block and notifying nodes for which the presence bit is set. Unowned: no copies Shared: read-only copies Exclusive: one read-write Busy: state in transition Each L2 cache line contains 4 data blocks of 8 words or 32 Bytes each (L1 data cache line size)

  31. CC-NUMA Architecture: Programming Proc 1 Proc 2 k i i Proc 3 = X j j k • All data is shared • Additional optimization to place data close to the processor that would do most of the computations on that data • Automatic (compiler) optimizations for single processor and parallel performance • The data access (data exchange) is implicit in the algorithm; • Except for the additional data placement directives, the source is the same as for the single processor programming (SMP principle) C every processor holds a column of each matrix: C$distribute A(*,block),B(*,block),C(*,block) C$omp parallel do DO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j) ENDDO ENDDO ENDDO

  32. Problems of CC-NUMA Architecture • SMP programming style + data placement techniques (directives) SMP programming Cliff remote memory latency jump ~3-5 requires correct data placement Based on 1 GB/s SCI link; latency/hop ~ 500 ns 64-128 processor O2000 ta(remote)/ta(local) ~3-5 ->correct data placement

  33. DSM-ccNUMA Memory Distributed Shared Memory Systems [ccNUMA) Easy to Program Easy to Scale Hard to scale Hard to program Shared-memory Systems (SMP) Massively Parallel Systems (MPP) Easy to Program Easy to Scale

  34. SGI 3200 (2-8p) Router-less configurations in deskside form factor Short Rack (17U config. space) C-Brick Network Network P P P P P, I, or, X-Brick BR BR P P P P I-Brick I-Brick XIO+ XIO+ C-Brick XIO+ Ports XIO+ Ports C-Brick C-Brick Power Bay Power Bay I-Brick P,I, or X-Brick Minimum (2p) System Maximum (8p) System System Topology

  35. SGI 3400 (4-32p) P P P P P P P P BR BR BR BR P P P P P P P P XIO+ XIO+ XIO+ XIO+ Full-size Rack (39U config. space) C-Brick P, I, or, X-Brick XIO+ XIO+ XIO+ XIO+ I-Brick C-Brick P, I, or, X-Brick P P P P P P P P C-Brick BR BR BR BR P, I, or, X-Brick P P P P P P P P C-Brick P, I, or, X-Brick r-Brick r-Brick 6-port router r-Brick 6-port router C-Brick P, I, or, X-Brick C-Brick P, I, or, X-Brick r-Brick r-Brick C-Brick P, I, or, X-Brick C-Brick I-Brick C-Brick C-Brick Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay System Topology Minimum (4p) System Maximum (32p) System

  36. SGI 3800 (16-128p) Rack 1 Rack 2 Rack 3 Rack 4 1 2 3 4 C C C C C C C C C C C C R R R R C C C C R R R R C C C C C C C C C C C C C C C C R-Brick C-Brick R-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick I-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick Power Bay Power Bay C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick Power Bay Power Bay C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick R-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick R-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick I-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay 128P System Topology R-Brick 8-port router Minimum (16p) System Maximum (128p) System

  37. SGI 3800 System: 128 processors 16 proc 16 proc 16 proc 16 proc 16 proc 16 proc 16 proc 16 proc

  38. SGI 3800 (32-512p) P, I, or, X-Brick P, I, or, X-Brick R-Brick R-Brick R-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick R-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick R-Brick R-Brick R-Brick R-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick P, I, or, X-Brick C-Brick C-Brick C-Brick C-Brick P, I, or, X-Brick I-Brick C-Brick C-Brick C-Brick C-Brick Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay Power Bay One Quadrant of a 512p System 512p Power Estimates: MIPS = 77 KW ItaniumTM= 150 KW McKinley = 231 KW No I/O or storage included in power estimates. Premium memory required

  39. Router-to-Router Connections for 256 Processor Systems

  40. 512 Processor Systems

  41. R1xK Family of Processors MIPS R1x000 is an out-of-order, dynamic-scheduling superscalar processor with non-blocking caches • Supports the 64-bit MIPS IV ISA • 4-way superscalar • Five separate execution units • 2 floating point results / cycle • 4-way deep speculative execution of branches • Out-of-order execution (48 instruction window) • Register re-naming • Two-way set associative non-blocking caches • Up to 4 outstanding memory read requests • Prefetching of data • 1MB to 8MB secondary data cache • Four user-accessible event counters

  42. Origin 3000 MIPS Processor Roadmap 1999 2000 2001 2002 2003 O3K-MIPS R18000 xxx MHz, xxx GFlops R16000 xxx MHz, xxx GFlops Origin 2000 R14000(A) 500+ MHz, 1000+ MFlops 8 MB DDR SRAM@ 250+ MHz R12000A 400 MHz, 800 MFlops 8 MB @ 266 MHz R12000 300 MHz, 600 MFlops 8 MB @ 200 MHz R10000 250 MHz, 500 MFlops 4 MB @ 250 MHz

  43. R14000 Cache Interfaces

  44. Memory Hierarchy Cache subsystem memory disk ~2-3 cy 1 ~10 cy 1400 0.1 1169 Origin3000 Latency 64reg 1200 1067 Origin2000 Latency Speed of Access 1/clock 1000 836 0.01 759 759 32KB (L1) 800 Remote Latency (ns) ~100 - 300 cy (NUMA) 554 600 8MB (L2) 585 485 343 400 435 335 335 285 ~4000 cy 200 235 175 175 ~1 - 100s GB 0 2p 4p 8p 16p 32p 64p 128p 256p 512p Device Capacity (size)

  45. Effects of Memory Hierarchy 1MB cache 32 KB L1 cache 4 MB L1 cache L2 cache: 2MB cache 4MB cache

  46. Instruction Latencies (R12K) • Integer units latency Repeat rate • ALU 1 • add, sub, logic ops, shift, br 1 1 • ALU 2 • add, sub, logic ops 1 1 • signed multiply (32/64 bit) 6/10 6/10 • (unsigned multiply: +1 cycle) • divide (32/64 bit) 35/67 35/67 • Address Unit • load integer 2 1 • load floating point 3 1 • store - 1 • Atomic LL,ADD,SC sequence 6 6 • Floating point units • FPU 1 • add, sub, compare, convert 2 1 • FPU 2 • multiply 2 1 • multiply-add (madd) 4 1 • FPU 3 • divide, reciprocal (32/64 bit) 12/19 14/21 • sqrt (32/64 bit) 18/33 20/35 • rsqrt (32/64 bit) 30/52 34/56 Repeat rate of 1 means that after pipelining processor can complete 1 operation per cycle. Thus the peak rates: Int operations: 2 int operations/cycle FP operations: 2 fp operations/cycle For the R14000@500MHz: 4*500 MHz = 2000 MIPS 2*500 MHz = 1000 Mflop/s Compiler has this table build in. The goal of compiler scheduling is finding instructions that can be executed in parallel to fill all slots: ILP - Instruction Level Parallelism

  47. Instruction Latencies: DAXPY Example Loop parallelism: 2 loads, 1 store 1 multiply-add (madd) 2 address increments 1 loop-end test 1 branch per single loop iteration Processor parallelism: 1 load or store 1 ALU1 instruction 1 ALU2 instruction 1 FP add 1 FP multiply per processor cycle • There are 2 loads (x,y) and 1 store (y)= 3 mem ops. • There are 2 fp operations (+,*) which can be done with 1 madd • 3 mem ops require 3 cycles minimum (processor can do 1 mem op/cycle) • theoretically in 3 cycles processor can do 6 fp operations • only 2 fp operations are available in the code • max processor speed is 2fp/6fp=1/3 peak on this code; • I.e. for the R12000@300MHz processor 600/3=200 Mflop/s. DO I=1,n Y(I) = Y(I) + A*X(I) ENDDO

  48. DAXPY Example: Schedules cycle instructions 0 ld x x++ 1 ld y 2 3 madd 4 5 6 7 st y br y++ x load delay 3 cycles cycle instructions 0 ld x0 1 ld x1 2 ld y0 x+=4 3 ld y1 madd0 4 madd1 5 6 7 st y0 8 st y1 y+=4 br madd delay 4 cycles x load delay 3 cycles madd delay 4 cycles • Simple schedule: unrolled by 2: • 2fp/(8cycles*2fp/cy)=1/8 peak 4fp/(9cycles*2fp/cy)=2/9 peak • R12000@300MHz ~ 75 Mflop/s ~133 Mflop/s DO I=1,n-1,2 Y(I+0) = Y(I+0) + A*X(I+0) Y(I+1) = Y(I+1) + A*X(I+1) ENDDO DO I=1,n Y(I) = Y(I) + A*X(I) ENDDO

  49. DAXPY Example: Software Pipelining #<swp> replication 0 #cy ld x0 ldc1 $f0,0($1) #[0] ld x1 ldc1 $f1,-8($1) #[1] st y2 sdc1 $f3,-8($3) #[2] st y3 sdc1 $f5,0($3) #[3] y+=2 addiu $3,$2,16 #[3] madd.d $f5,$f2,$f0,$f4 #[4] ld y0 ldc1 $f0,-8($2) #[4] madd.d $f3,$f0,$f1,$f4 #[5] x+=2 addiu $1,$1,16 #[5] beq $2,$4,.BB21.daxpy #[5] ld y3 ldc1 $f2,0($3) #[5] #<swp> replication 1 #cy ld x3 ldc1 $f1,0($1) #[0] ld x2 ldc1 $f0,-8($1) #[1] st y1 sdc1 $f3,-8($2) #[2] st y0 sdc1 $f5,0($2) #[3] y+=2 addiu $2,$3,16 #[3] madd.d $f5,$f2,$f1,$f4 #[4] ld y3 ldc1 $f1,-8($3) #[4] madd.d $f3,$f1,$f0,$f4 #[5] x+=2 addiu $1,$1,16 #[5] ld y0 ldc1 $f2,0($2) #[5] • Software pipeliningis the way to fill all processor slots by mixing iterations • replications gives how many iterations are mixed • number of replications depends on the distance (in cycles) between the load and the calculation • DAXPY 6 cy schedule with 4 fp ops: 4fp/(6cy*2fp/cy)=1/3 peak

  50. DAXPY SWP: Compiler Messages • F77 -mips4 -O3 -LNO:prefetch=0 -S daxpy.f • With the -S switch the compiler will produce file daxpy.s with assembler instructions and comments about software pipelining schedules • #<swps> Pipelined loop line 6 steady state • #<swps> 50 estimated iterations before pipelining • #<swps> 2 unrolling before pipelining • #<swps> 6 cycles per 2 iterations • #<swps> 4 flops ( 33% of peak)(madds count 2fp) • #<swps> 2 flops ( 16% of peak)(madds count 1fp) • #<swps> 2 madds ( 33% of peak) • #<swps> 6 mem refs (100% of peak) • #<swps> 3 integer ops ( 25% of peak) • #<swps> 11 instructions ( 45% of peak) • #<swps> 2 short trip threshold • #<swps> 7 ireg registers used. • #<swps> 6 fgr registers used. • The schedule is the max 1/3 peak processor performance, as expected • note: it is necessary to switch off prefetch to attain max schedule

More Related