260 likes | 423 Views
Altix 4700. ccNUMA Architecture. Distributed Memory - Shared address space. Altix HLRB II – Phase 2. 19 partitions with 9728 cores Each with 256 Itanium dual-core processors, i.e., 512 cores Clock rate 1.6 GHz 4 Flops per cycle per core 12,8 GFlop/s (6,4 GFlop/s per core)
E N D
ccNUMA Architecture • Distributed Memory - Shared address space
Altix HLRB II – Phase 2 • 19 partitions with 9728 cores • Each with 256 Itanium dual-core processors, i.e., 512 cores • Clock rate 1.6 GHz • 4 Flops per cycle per core • 12,8 GFlop/s (6,4 GFlop/s per core) • 13 high-bandwidth partitions • Blades with 1 processor (2 cores) and 4 GB memory • Frontside bus 533 MHz (8.5 GB/sec) • 6 high-density partitions • Blades with 2 processors (4 cores) and 4 GB memory. • Same memory bandwidth. • Peak Performance: 62,3 TFlops (6.4 GFlops/core) • Memory: 39 TB
Memory Hierarchy • L1D • 16 KB, 1 cycle latency, 25,6 GB/s bandwidth • cache line size 64 bytes • L2D • 256 KB, 6 cycles, 51 GB/s • cache line size 128 bytes • L3 • 9 MB, 14 cycles, 51 GB/s • cache line size 128 bytes
Interconnect • NUMAlink 4 • 2 links per blade • Each link 2*3,2 GB/s bandwidth • MPI latency 1-5µs
Disks • Direct attached disks (temporary large files) • 600 TB • 40 GB/s bandwidth • Network attached disks (Home Directories) • 60 TB • 800 MB/s bandwidth
Environment • Footprint: 24 m x 12 m • Weight: 103 metrictons • Electrical power: ~1 MW
NUMAlink Building Block NUMALink 4RouterLevel 1 NUMALink 4RouterLevel 1 8 cores (high bandwidth) 16 cores (high-density) PCI/FC BLADE BLADE BLADE BLADE IO BLADE BLADE BLADE BLADE BLADE IO BLADE SANSwitch 10 GE NUMALink 4RouterLevel 1 NUMALink 4RouterLevel 1
Interconnection of Partitions • Gray squares • 1 partition with 512 cores • L: Login B:Batch • Lines • 2 NUMALink4 planes with 16 cables • each cable: 2 * 3,2 GB/s
Interactive Partition • Login cores • 32 for compile & test • Interactive batch jobs • 476 cores • managed by PBS • daytime interactive usage • small-scale and nighttime batch processing • single partition only • High-density blades • 4 cores per memory 4 OS 16 Login 4 Login 16 12 Login 12 Batch 16 16 16 16 16 16 16 16
18 Batch Partitions • Batch jobs • 510 (508) cores • managed by PBS • large-scale parallel jobs • single or multi-partition jobs • 5 partitions with high-density blades • 13 partitions with high-bandwidth blades 4 OS 8 (16) 8 (16) 8 (16) 6 (12) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16) 8 (16)
Coherence Implementatioin • SHUB2 supports up to 8192 SHUBs (32768 cores) • Coherence domain up to 1024 SHUBs(4096 cores) • SGI term: "Sharing mode" • Directory with one bit per SHUB • Multiple shared copies are supported. • Accesses of other coherence domains • SGI term: "Exclusive sharing mode" • Always translated in exclusive access • Only single copy is supported • Directory stores the address of SHUB(13 bits)
SHMEM Latency Model for Altix • SHMEM get latency is sum of: • 80 nsec for function call • 260 nsec for memory latency • 340 nsec for first hop • 60 nsec per hop • 20 nsec per meter of NUMAlink cable • Example • 64 P system: max hops is 4, max total cable length is 4. • Total SHMEM get latency is: 1000 nsec = 80 + 260 + 340 + 60x4 + 20x4
Parallel Programming Models Intra-Host (512 cores) Intra-CoherencyDomain (4096 cores)and across entire machine Altix® System Coherency Domain 1 OpenMP Pthreads MPI SHMEMTM Global segments Linux Image 1 MPI SHMEM Global Segments Linux Image 2 Coherency Domain 2
Barrier Synchronization • Frequent in OpenMP, SHMEM, MPI singlesidedops (MPI_Win_fence) • Tree-basedimplementationusing multiple fetch-op variables tominimizecontention on SHUB. • UsinguncachedloadtoreduceNUMAlinktraffic. CPU HUB ROUTER Fetch-op variable CPU
Programming Models • OpenMP on an Linux image • MPI • SHMEM • Shared segments (System V und Global Shared Memory)
SHMEM • Can be used for MPI programs where all processes execute same code. • Enables access within and across partitions. • Static data and symmetric heap data (shmalloc or shpalloc) • info: man intro_shmem
Example #include <mpp/shmem.h> main() { long source[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }; static long target[10]; MPI_Init(…) if (myrank == 0) { /* put 10 elements into target on PE 1 */ shmem_long_put(target, source, 10, 1); } shmem_barrier_all(); /* sync sender and receiver */ if (myrank == 1) printf("target[0] on PE %d is %d\n", myrank,target[0]); }
Global Shared Memory Programming • Allocation of a shared memory segment via collective GSM_alloc. • Similar to memory mapped files or System V shared segments. But these are limited to a single OS instance. • GSM segment can be distributed across partitions. • GSM_ROUNDROBIN: Pages are distributed in roundrobin across processes • GSM_SINGLERANK: Places all pages near to a single process • GSM_CUSTOM_ROUNDROBIN: Each process specifies how many pages should be placed in its memory. • Data structures can be placed in this memory segment and accessed from all processes with normal load and store instructions.
Example #include <mpi_gsm.h> placement = GSM_ROUNDROBIN; flags = 0; size = ARRAY_LEN * sizeof(int); int *shared_buf; rc = GSM_Alloc(size, placement, flags, MPI_COMM_WORLD,&shared_buf); // Have one rank initialize the shared memory region if (rank == 0) { for(i=0; i < ARRAY_LEN; i++) { shared_buf[i] = i; } } MPI_Barrier(MPI_COMM_WORLD); // Have every rank verify they can read from the shared memory for (i=0; i < ARRAY_LEN; i++) { if (shared_buf[i] != i) { printf("ERROR!! element %d = %d\n", i, shared_buf[i]); printf("Rank %d - FAILED shared memory test.\n", rank); exit(1); } }
Summary • Altix 4700 is a ccNUMA system • >60 TFlop/s • MPI messages sent with two-copy or single-copy protocol • Hierarchical coherence implementation • Intranode • Coherence domain • Across coherence domains • Programming models • OpenMP • MPI • SHMEM • GSM