420 likes | 614 Views
SR8000 Concept. Tim Lanfear Hitachi Europe GmbH. t-lanfear@hpcc.hitachi-eu.co.uk. SR8000 Model Range. SR8000 Appearance. Compact Model. Vector vs SMP vs MPP. System Architecture. Cross-bar Inter-node Network. Node (ION). Node (PRN). Node (PRN). CPU. CPU. PCI. System Control.
E N D
SR8000 Concept Tim Lanfear Hitachi Europe GmbH. t-lanfear@hpcc.hitachi-eu.co.uk
System Architecture Cross-bar Inter-node Network Node (ION) Node (PRN) Node (PRN) CPU CPU PCI System Control Network Control Main Memory Ether, ATM, HIPPI Service Processor Console RAID Disk
Main Memory Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit CPU Architecture • 16 bytes/cycle memory BW • 128 Kbyte L1 cache • Pre-fetch and pre-load instructions • 160 f.p. registers • 2 f.p. pipelines • 4 flops/cycle
Slide Window Registers 32 to 125 0 to 15 16 to 31 126-7 Base=2 Logical 32 to 123 0 to 15 16 to 31 124-7 Base=4 Global part: 128 to 159 Physical Sliding part: 0 to 127 • Registers for all instructions • Registers for extended instructions only • Fixed registers: 4, 8, 16, 32 (16 illustrated) • Fixed + sliding = 128
Load and store with extended registers Floating point arithmetic with extended registers Slide window control Pre-fetch and pre-load Thread start-up and finish Predicate instructions Instruction Set Extensions
SR8000 Programming Instruction Level Parallelism (Pseudo-vector Processing: PVP)
Main Memory Memory Switch Pre-fetch Pre-load Cache Load Floating Point Registers Arithmetic Unit Pre-fetch and Pre-load • Pre-fetch: load cache line from memory to cache • Pre-load: load one word from memory to register • 16 streams
1 PF Latency LD Use data 2 LD Use data 3 LD Use data 4 LD Use data 5 PF Latency LD Use data 6 LD Use data Pre-fetch Iteration • Pre-fetch 128 bytes to cache • Follow by LD to register
2 PL Latency Use data 3 PL Latency Use data 4 PL Latency Use data 5 PL Latency Use data 6 PL Latency Use data Pre-load Iteration 1 PL Latency Use data • Pre-load 8 bytes to register • LD not required
I=1 I=2 I=3 I=3 I=1 I=2 I=3 I=1 I=2 Infinite resource Finite resource Initiation interval Recurrence =a I=1 a= =a I=2 a= =a I=3 a= Software Pipelining No SWPL Resources: registers, f.p. units, instruction issue, memory bandwidth etc
PF Lat LD + ST LD + ST LD + ST VST VADD VLD LD + ST PF Lat LD + ST LD + ST LD + ST Pseudo-vector Processing A(:) = A(:) + N Pseudo-Vector Vector
Effect of PVP Dot product: S = A(1:N)*B(1:N)
SR8000 Programming Multi-thread Parallelism (Cooperative Microprocessors in a Single Address Space: COMPAS)
Node Node Node Node COMPAS Multi-dimensional Crossbar Network Node IP IP IP IP . . . . Main memory (shared) Automatic Parallel Processing process IP IP IP COMPAS (Start Inst.) thread Pre-fetch Pre-fetch Load Load Arithmetic Arithmetic Store Store Branch Branch IP: Instruction Processor COMPAS ( End Inst.) COMPAS: Co-operative Micro-Processors in single Address Space
IP IP IP IP Hardware Support Software IP IP IP IP Scalar Part (waiting for startup) (waiting for startup) (waiting for startup) Start Parallel Inst. Loop Part Loop Part Loop Part Loop Part End Parallel Inst. Scalar Part Hardware Support Barrier Synchronization Mechanism SC IP:Instruction Processor SC:Storage Controller MS:Main Storage MS
[fork] • DO i =start,end • A(i)=B(i)+C(i) • ENDDO • [join] i loop parallelisation • DO i =1,N • A(i)=B(i)+C(i) • ENDDO • [fork] • DO j=start,end • W(j)=C(j)+D(j) • DO i=1,N • A(i,j)=B(i,j)+W(j) • ENDDO • ENDDO • [join] • DO j=1,M • W(j)=C(j)+D(j) • DO i=1,N • A(i,j)=B(i,j)+W(j) • ENDDO • ENDDO j loop parallelisation Loop Parallelisation
[fork] • DO j=2,M • DO i=start,end • A(i,j) = A(i,j-1)+A(i,j) • ENDDO • ENDDO • [join] • DO j=2,M • DO i=1,N • A(i,j) = A(i,j-1)+A(i,j) • ENDDO • ENDDO i loop parallelisation • [fork] • DO i=start,end • A(i) = B(i)+C(i) • ENDDO • DO j=start,end • D(j) = E(j)*F(j) • ENDDO • [join] i loop parallelisation • DO i=1,N • A(i) = B(i)+C(i) • ENDDO • DO j=1,M • D(j) = E(j)*F(j) • ENDDO j loop parallelisation Loop Parallelisation
Loop Parallelisation • *poption parallel force parallelisation • *poption tlocal(a,b,i) thread local variables • [fork] • DO i = 1,N • CALL sub(a,b,i) • ENDDO • [join] • DO i = 1,N • CALL sub(a,b,i) • ENDDO
Section Parallelisation Execution of independent blocks of code in different threads (sections are always single threaded) • *poption parallel_sections • *poption section • CALL SUB1 • *poption section • CALL SUB2 • *poption end_parallel_sections
Effect of COMPAS Dot product: S = A(1:N)*B(1:N)
SR8000 Programming Message Passing (MPI)
data Remote DMA Protocol Processing Context Switch Interrupt Handling Remote DMA Transfer No Buffering in Kernel No OS System Call Normal Transfer Node Node Program Program data memory copy memory copy OS OS Send Buffer data Receive Buffer data Crossbar Network
Inter-node MPI Cross-bar Inter-node Network MPI MPI MPI One MPI process per node; RDMA transfer possible
Intra-node MPI Cross-bar Inter-node Network MPI MPI MPI MPI MPI MPI MPI MPI MPI Shared memory Shared memory Shared memory One MPI process per IP; RDMA transfer not possible
Message passing (MPI) Multi-thread (COMPAS) Instruction level (PVP) Node 2 Node 1 SR8000 Parallelism
SR8000 Programming Memory Architecture
Memory Hierarchy fp registers (128+32) 32 b/cyc 16 b/cyc L1 cache (128 Kb 4-way) Store buffer (16 entries) Other IPs 16 b/cyc Switch Memory (2 to 16 Gb, 512 banks)
Address Translation Virtual address Page offset Virtual page number Main memory Cache recently used entries of page table in TLB Page table
Large TLB Virtual address Page offset Virtual page number Main memory Large TLB covers whole address space with 256 entries. Page size 16Mb to 128 Mb Large page table
Memory Address Hashing 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 xor xor memory controller data path storage controller data path
High performance RISC CPU with PVP High performance node with COMPAS High sustained memory bandwidth High scalability with fast network Low energy and space requirements Key Features of SR8000
SR8000 Programming Performance
Linpack Performance 1000 917.15 (100 nodes) 10.88 Gflops on 1 node 20.50 Gflops on 2 nodes 40.76 Gflops on 4 nodes 900 800 700 605.30 (64 nodes) 600 577.49 (60 nodes) GFlops 500 400 313.32 (32nodes) 300 200 159.51 (16 nodes) 100 80.25 (8 nodes) 0 0 20 40 60 80 100 120 Number of nodes
35 28.78 30 ClassA 27.95 26.16 ClassB 25 ClassC 20 GFlops 15.10 14.84 14.01 15 8.37 8.31 10 7.92 5.39 5.14 5 0 1 2 4 8 Number of Nodes NAS Parallel FT