310 likes | 482 Views
Next KEK machine. Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005. KEK supercomputer. Leading computing facility in that time 1985 Hitachi S810/10 350 MFlops 1989 Hitachi S820/80 3 GFlops 1995 Fujitsu VPP500 128 GFlops 2000 Hitachi SR8000 F1 1.2 TFlops
E N D
Next KEK machine Shoji Hashimoto (KEK) @ 3rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005
KEK supercomputer Leading computing facility in that time • 1985 Hitachi S810/10 350 MFlops • 1989 Hitachi S820/80 3 GFlops • 1995 Fujitsu VPP500 128 GFlops • 2000 Hitachi SR8000 F1 1.2 TFlops • 2006 ??? Shoji Hashimoto (KEK)
Formality • “KEK Large Scale Simulation Program” : call for proposals of project to be performed on the supercomputer. • Open for Japanese researcher working on high energy accelerator science (particle and nuclear physics, astrophysics, accelerator physics, material science related to the photon factory) • Program Advisory Committee (PAC) decides the approval and machine time allocation. Shoji Hashimoto (KEK)
Usage Lattice QCD is a dominant user. • About 60-80% of the computer time for lattice QCD • Among them, ~60% is for the JLQCD collaboration • Others include Hatsuda-Sasaki, Nakamura et al., Suganuma et al., Suzuki et al. (Kanazawa), … • Simulation for accelerator design is another big user: beam-beam simulation for the KEK-B factory. Shoji Hashimoto (KEK)
JLQCD collaboration • 1995~ (on VPP500) • Continuum limit in the quenched approximation ms BK fB, fD Shoji Hashimoto (KEK)
JLQCD collaboration • 2000~ (on SR8000) • Dynamical QCD with the improved Wilson fermion mV vs mPS2 fB, fBs Kl3 form factor Shoji Hashimoto (KEK)
Around the triangle Shoji Hashimoto (KEK)
Chiral extrapolation: very hard to go beyond ms/2 Problem for every physical quantities. Maybe, solved by the new algorithms and machines… JLQCD Nf=2 (2002) MILC coarse lattice (2004) The wall New generation of dynamical QCD Shoji Hashimoto (KEK)
Upgrade Thanks to Hideo Matsufuru (Computing Research Center, KEK) for his hard work. • Upgrade scheduled on March 1st 2006. • Called for bids from vendors. • At least 20x more computing power, measured mainly using the QCD codes. • No restriction on architecture (scalar or vector, etc.), but some amount must be a shared memory machine. • Decision was made, recently. Shoji Hashimoto (KEK)
The next machine A combination of two systems: • Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak performance. • IBM Blue Gene/L, 10 racks, 57.3 TFlops peak performance. Hitachi Ltd. is the prime contractor. Shoji Hashimoto (KEK)
Hitachi SR11000 K1 Will be announced, tomorrow. • POWER5+: 2.1 GHz, dual core, 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB L2 (on chip), 36 MB L3 (off chip) • 8.5 GB/s chip-memory bandwidth, hardware and software prefetch • 16-way SMP (134.4 GFlops/node), 32 GB memory (DDR2 SDRAM). • 16 nodes (2.15 TFlops) • Interconnect: Federation switch 8GB/s (bidirectional) Shoji Hashimoto (KEK)
SR11000 node Shoji Hashimoto (KEK)
16-way SMP Shoji Hashimoto (KEK)
High Density Module Shoji Hashimoto (KEK)
IBM Blue Gene/L • Node: 2 PowerPC440 (dual core), 700 MHz, double FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared), 512 MB memory. • Interconnect: 3D torus, 1.4 Gbps/link (6 in + 6 out) from each node. • Midplane: 8x8x8 nodes (2.87 TFlops); rack = 2 Midplane • 10 rack system All the info in the following comes from the “Red book” (ibm.com/redbooks) and the articles in IBM Journal of Research and Development. Shoji Hashimoto (KEK)
BG/L system 10 Racks Shoji Hashimoto (KEK)
Double Floating-Point-Unit (FPU) added to the PPC440 core. 2 fused multiply-add per core Not a true SMP: L1 has no cache coherency, L2 has a snoop. Shared 4MB L3. Communication between the two core through the “multiported shared SRAM buffer” Embedded memory controller and networks. BG/L node ASIC Shoji Hashimoto (KEK)
Compute note modes • Virtual node mode: use both CPUs separately, running a different process on each core. Communication using MPI, etc. Memory and bandwidth are shared. • Co-processor mode: use the secondary processor as a co-processor for communication. Peak performace is ½. • Hybrid node mode: use the secondary processor also for computation. Need a special care about the L1 cache incoherency. Used for Linpack. Shoji Hashimoto (KEK)
QCD code optimization Jun Doi and Hikaru Samukawa (IBM Japan): • Use the virtual node mode • Fully used the Double FPU (hand-written assembler code) • Use a low-level communication API Shoji Hashimoto (KEK)
Double FPU • SIMD extension of PPC440. • 32 pairs of 64-bit FP register, addresses are shared. • Quadword load and store. • Primary and secondary pipelines. Fused multiply-add for each pipe. • Cross operations possible; best suited for complex arithmetic. Shoji Hashimoto (KEK)
Examples Shoji Hashimoto (KEK)
SU(3) matrix*vector y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2]; y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2]; y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2]; complex mult: u[0][0] * x[0] re(y[0])=re(u[0][0])*re(x[0]) im(y[0])=re(u[0][0])*im(x[0]) FXPMUL (y[0],u[0][0],x[0]) FXCXNPMA (y[0],u[0][0],x[0],y[0]) re(y[0])+=-im(u[0][0])*im(x[0]) im(y[0])+=im(u[0][0])*re(x[0]) + u[0][1] * x[1] + u[0][2] * x[2]; FXCPMADD (y[0],u[0][1],x[1],y[0]) FXCXNPMA (y[0],u[0][1],x[1],y[0]) FXCPMADD (y[0],u[0][2],x[2],y[0]) FXCXNPMA (y[0],u[0][2],x[2],y[0]) must be combined with other rows to avoid pipeline stall (wait 5 cycles). Shoji Hashimoto (KEK)
32+32 registers can hold 32 complex numbers. 3x3(=9) for a gauge link; 3x4(=12) for a spinor: need 2 spinors for input and output Load the gauge link while computing, using 6+6 registers. Straightforward for y+=U*x, but not so for y+=conjg(U)*x. Use the inline-assembler of gcc; xlf and xlc have intrinsic functions. Early xlf/xlc wasn’t good enough to produce these code, but is improved more recently. Scheduling Shoji Hashimoto (KEK)
Parallelization on BG/L Example: 243x48 lattice. • Use the virtual node mode. • For the midplane, divide the entire lattice onto 2x8x8x8 processors. For one rack, 2x8x8x16. (2 is inner-node.) • To use more than one rack, 323x64 lattice is the minimum. • Each processor has 12x3x3x6 (or 12x3x3x3) lattice. Shoji Hashimoto (KEK)
Communication is fast: 6 links to nearest-neighbors. 1.4 Gbps (bi-directional) for each link. latency is 140ns for one hop. MPI is too heavy: Need additional buffer copy = waste the cache and memory bandwidth. Multi-thread not available in the virtual node mode. Overlapping comp and comm is not possible within MPI. Communication Shoji Hashimoto (KEK)
“QCD Enhancement Package” Low-level communication API • Directly send/recv by accessing the torus interface FIFO. No copy to memory buffer. • Non-blocking send; blocking recv. • Up to 224 byte data to send/recv at once (spinor at one site = 192 byte). • Assuming the nearest-neighbor communication. Shoji Hashimoto (KEK)
An example #define BGLNET_WORK_REG 30 #define BGLNET_HEADER_REG 30 BGLNetQuad* fifo; BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); for(i=0;i<Nx;i++){ // put results to reg 24--29 BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24); BGLNet_Send_Enqueue(fifo,25); BGLNet_Send_Enqueue(fifo,26); BGLNet_Send_Enqueue(fifo,27); BGLNet_Send_Enqueue(fifo,28); BGLNet_Send_Enqueue(fifo,29); BGLNet_Send_Packet(fifo); } Create packet header Put the packet header to the send buffer Put the data to the send buffer Kick! Shoji Hashimoto (KEK)
Wilson solver (BiCGstab) 243x48 lattice on a midplace (8x8x8=512 nodes, half rack) 29.2% of the peak performance 32.6% if measured the Dslash only Domain-wall solver (CG) 243x48 lattice on a midplace; Ns=16. Doesn’t fit in the on-chip L3 ~22% of the peak performance Benchmark Shoji Hashimoto (KEK)
Vranas @ Lattice 2004 Comparison ~50% improvement Shoji Hashimoto (KEK)
Physics target “Future opportunities: ab initio calculations at the physical quark masses” • Using dynamical overlap fermion • Details are under discussion (actions, algorithms, etc.) • Primitive code has been written; test runs are on-going on SR8000. • Many things to do by March… Shoji Hashimoto (KEK)
Summary • New KEK machine will be made available for Japanese lattice community on March 1st, 2006. • Hitachi SR11000 (2.15 TF) + IBM BlueGene/L (57.3 TF) Shoji Hashimoto (KEK)