170 likes | 391 Views
Cray SV1. by Kent Milfeld U. of Texas, ACCES Advanced Computing Center for Engineering and Science. SV1 OUTLINE. SV1 Processor SV1 Memory Multi-Streaming Processing (MSP) GigaRing. Cray SV1. aurora.hpc.utexas.edu SMP System 16 Processors 16 GB Memory Vector Processors
E N D
Cray SV1 by Kent Milfeld U. of Texas, ACCES Advanced Computing Center for Engineering and Science
SV1 OUTLINE • SV1 Processor • SV1 Memory • Multi-Streaming Processing (MSP) • GigaRing
Cray SV1 aurora.hpc.utexas.edu SMP System 16 Processors 16 GB Memory Vector Processors 64-bit representation • SV1-1A Evolved from J90 Series • Air Cooled, 4 processors/board V
Software Overview • Queuing system NQS/NQE • Compilers and Programming Tools F90, C++, C { totalview, apprentice, ATExpert, profview, hpm } • Libraries libsci, FFIO, IMSL, NAG • Applications and Tools abaqus, ls-dyna3d …, G98, gamess, amber, …
Programming Model(s) • Shared Memory • Multitasking, OpenMP, Pthreads • MPI available through MPT (message passing toolkit) • MSP MultiStreaming Processing
Performance Considerations local global 1 2 Cache Latency Instruction Issue Memory Latency Instruction Issue scalar pipelining or vectorization 0.5-1.0* 0.5* Cache Bandwidth Memory Bandwidth 3 4 vector cache blocking Optimal Coding Style 0.75* .25-.3* *Performance Relative to T90
SV1 Processor • 300 MHz 0.18 m technology • 2-Vector Pipes • Pipes (8, 9, 16 CP pipes for +, *, /) • 1.2GF = 300MHz*2(flops/triad)*2(pipes) • 256KB Vector/Scalar-Cache • Cray 64-bit FP Representation
Functional Units Vector Functional Units Scalar Functional Units Address Functional Units 8 S 8 A Execution Shared Registers 32SM 8SB,8ST 8x64 Vector Registers Instruction Buffers 8x32 64 T 64 B 128W Cache Memory J90 Block Diagram
Functional Units Vector Functional Units Scalar Functional Units Address Functional Units 2nd Vector Functional Units 8 S 8 A Execution Shared Registers 32SM 8SB,8ST 8x64 Vector Registers Instruction Buffers 8x32 64 T 64 B 32KW “Vector/Scalar” Cache 128W Cache Memory SV1 Block Diagram J90 Block Diagram
Vector/Scalar Cache 1.2GF CPU 9.6 GB/s 256KB Cache 4-way associative 1 word per cache line PE 1 9.6 GB/s VA/VB Memory fan-in/fan-out Memory
Bandwidth Limits 0 1 4 5 8 9 12 13 CPU Modules 2 3 6 7 10 11 14 15 4 reads or 2 reads 2 writes 4 reads or 2 reads 2 writes 4 reads or 2 reads 2 writes 4 reads or 2 reads 2 writes } PER CPU 8 read or 4 writes+4reads (8 different sections) 8 read or 4 writes+4reads (8 different sections) 8 read or 4 writes+4reads (8 different sections) 8 read or 4 writes+4reads (8 different sections) } PER Module } Module Interface } Memory section 0 section 1 section 2 section 3 section 4 section 5 section 6 section 7
SV1 Multiprocessing • Shared processors Autotasking (autotask lib: Compiler/Directives)OpenMP (Directives) • “Dedicated” processors MSP (Multi-Streaming Processor) 4-CPUsImplemented in Software (by compiler). Compiler creates multiple instruction streams for vector operations on each PE (does not use autotask lib)
8-Pipe MSP 4-PE Module 4-PE Module 4-PE Module 4-PE Module MSP 4.8 GFLOPS 8 Pipes PE 1 PE 2 PE 3 PE 4 6.4 GB/s Memory
GigaRing • Two “counter rotating rings”, each 400MB/sec. • One GigaRing channel adapter per module. • Clusters are interconnected through a GigaRing.
GigaRing Topology machine DISK TAPE SV1 Ethernet FDDI HiPPI
GigaRing • 64 bit Client Interface, Fault Tolerant • MPN -FDDI, ATM, SCSI, Ethernet • FCN -Raid3, 100MB/sec (2 channels) • HPN -HiPPI (100/200 MB/s)