470 likes | 637 Views
Origin 2000 ccNUMA Architecture Joe Goyette Systems Engineer goyette@sgi.com. Presentation Overview. 1. ccNUMA Basics 2. SGI’s ccNUMA Implementation (O2K) 3. Supporting OS Technology 4. SGI’s NextGen ccNUMA (O3K) (brief) 5. Q&A. ccNUMA. cc: cache coherent NUMA: Non-Uniform Memory Access
E N D
Origin 2000ccNUMA ArchitectureJoe GoyetteSystems Engineergoyette@sgi.com
Presentation Overview • 1. ccNUMA Basics • 2. SGI’s ccNUMA Implementation (O2K) • 3. Supporting OS Technology • 4. SGI’s NextGen ccNUMA (O3K) (brief) • 5. Q&A
ccNUMA • cc: cache coherent • NUMA: Non-Uniform Memory Access • Memory is physically distributed throughout the system • memory and peripherals are globally addressable • Local memory accesses are faster than remote accesses (Non-Uniform Memory Access = NUMA) • Local accesses on different nodes do not interfere with each other
Typical SMP Model Processor Processor Processor Snoopy Cache Snoopy Cache Snoopy Cache Central Bus Main Memory I/O Main Memory I/O Main Memory I/O Main Memory I/O
Operating System Operating System Operating System Main Memory Main Memory Main Memory Processor Processor Processor I/O I/O I/O Typical MPP Model Interconnect Network (ie. GSN,100BaseT, Myrinet)
Scalable Shared Memory Systems [ccNUMA) Scalable Cache Coherent Memory Easy to Program Easy to Scale Hard to scale Hard to program Shared-memory Systems (SMP) Massively Parallel Systems (MMP) Easy to Program Easy to Scale
Origin ccNUMA vs other Architectures Origin Conventional SMP Other NUMA Clusters/MPP > Single Address Space > Modular Design > All aspects scale as system grows > Low-latency, high bandwidth global memory
N N N N N N N N N N N N N N N N R R R R R R R R Origin ccNUMA Advantage Other NUMA Fixed bus SMP N N N N N N N Interconnection Bisections N N N N N N N N N Clusters, MPP Origin 2000 ccNUMA
- IDC, September 1998 Architecture type Bus-based SMP NUMA SMPMessage Passing Switch-based SMP Uni-processor NUMA (uni-node) 1996 share 54.7% 3.9% 16.4% 13.0% 9.4% 1.5% 1997 share 41.0% 20.8% 15.3% 12.1% 5.5% 5.3% Change -13.7 pts. +16.9 pts. -1.1 pts. -0.9 pts. -3.9 pts. +3.8 pts. Source: High Performance Technical Computing Market: Review and Forecast, 1997-2002 International Data Corporation, September 1998 IDC: NUMA is the future “Buses are the preferred approach for SMP implementations because of their relatively low cost. However, scalability is limited by the performance of the bus.” “NUMA SMP ... appears to be the preferred memory architecture for next-generation systems.”
SGI’s First Commercial ccNUMA Implementation Origin 2000 Architecture
History of Multiprocessing at SGI Origin 2000 ccNUMA introduced CPUs Origin 3000 2-1024 CPUs Origin 2000 2-256 CPUs 256 Origin 2000 2-64 CPUs 128 Origin 2000 2-32 CPUs Challenge 2-36 CPUs 64 32 2 1993 1996 1997 1998 1999 2000
N N N N N N N N N N N N N N N N R R R R R R R R Origin 2000 Logical Diagram32 CPU Hypercube (3D)
Main Memory Directory Directory >32P Hub Proc. Proc. Cache Cache Origin 2000 Node Board • Basic Building Block Node Board
64-bit RISC design, 0.25-micron CMOS process Single-chip four-way superscalar RISC dataflow architecture 5 fully-pipelined execution units supports speculative and out-of-order execution 8MB L2 cache Origin 2000, 4MB Origin 200 32KB 2-way set-associative instruction and data caches 2,048-entry branch prediction table 48-entry active list 32-entry two-way set-associative Branch Target Address Cache (BTAC) Doubled L2 way prediction table for improved L2 hit rate Improved branch prediction by using global history mechanism Improved performance monitoring support Maintains code and instruction set compatibility with R10000 MIPS R12000 CPU
Memory Hierarchy • 1. local cpu registers • 2. local cpu cache 5 ns • 3. local memory 318 ns • 4. remote memory 554 ns • 5. remote caches
Directory Based Cache Coherency Cache Coherency == System hw guarantees that every cached copy remains a true reflection of the memory data, without sw intervention. Directory Bits consist of two parts: a. 8-bit integer representing node that has exclusive ownership of data b. Bit map that represents which nodes have copies of data in cache.
Cache Example • 1. data read into cache for thread on CPU 0 • 2. threads on CPUs 1 and 2 read data into cache • 3. thread on CPU 2 updates data in cache (cacheline is set exclusive) • 4. Eventually cache line gets invalidated
N N N N N N N N N N N N N N N N R R R R R R R R Router and Interconnect Fabric • 6-way non-blocking crossbar (9.3 Gbytes/sec) • Link Level Protocol (LLP) uses CRC error checking • 1.56 Gbyte/sec (peak full-duplex) per port • packet delivery prioritization (credits, aging) • uses internal routing table and supports wormhole routing • internal buffers (SSR/SSD) down-convert 390MHz external signaling to core frequency. • Three ports connect to external 100 conductor NumaLink cables. Global Switch Interconnect
Main Memory Directory Main Memory Directory Main Memory Directory Main Memory Directory Directory >32P Directory >32P Directory >32P Directory >32P Hub Hub Hub Hub Proc. Proc. Proc. Proc. Proc. Proc. Proc. Proc. Cache Cache Cache Cache Cache Cache Cache Cache Origin 2000 Module • Basic Building Block XBOW Midplane XBOW Router Board Router Board Node Boards
Etc... Multi-rack (4 Modules) Rack (2 Modules) Deskside (Module) ..128 CPUs 32 CPUs 16CPUs 2-8 CPUs Modules become Systems
N N N N N N N N R R R R Origin 2000 Grows to Single Rack • Single Rack System • 2-16 CPUs • 32GB Memory • 24 XIO I/O slots
N N N N N N N N N N N N N N N N R R R R R R R R Origin 2000 Grows to Multi-Rack • Multi-Rack System • 17-32 CPUs • 64GB Memory • 48 XIO I/O slots • 32-processor hypercube building block
Origin 2000 Grows to Large Systems • Large Multi-Rack Systems • 2-256 CPUs • 512GB Memory • 384 I/O slots + + + =
Origin 2000 Bandwidth Scales Origin 2000/300MhZ STREAM Triad results Origin 2000/250MhZ SUN UE10000 Compaq/DEC 8400 HP/Convex V HP/Convex SPP
Performance on HPC job mix SPECfp_rate95 results Origin 300Mhz IBM Origin 250Mhz Origin 195Mhz DEC SUN HP
Enabling Technologies IRIX: NUMA Aware OS and System Utilities
Default Memory Placement • Memory allocated on “first-touch” basis- on node where process that defines page is running - or as close as possible (minimize latency) • - developers should initialize work areas in newly created threads • IRIX scheduler maintains process affinity • - re-schedules jobs on processor where they ran last • - or on other CPU in the same node • - or as close as possible )minimize latency)
Alternatives to “first-touch” policy • Round Robin Allocation • - Data is distributed at run-time among all nodes used for execution • - setenv_DSM_ROUND_ROBIN_
Dynamic Page Migration • IRIX can keep track of run-time memory access patterns and dynamically copy pages to new node. • Expensive operation. Requires: daemon, TLB invalidations, and the memory copy itself.) • setenv _DSM_MIGRATION ON • setenv _DSM_MIGRATION_LEVEL 90
Explicit Placement: source directives • integer i, j, n, niters • parameter (n = 8*1024*1024, niters = 1000) • c-----Note that the distribute directive is used after the arrays • c-----are declared. • real a(n), b(n), q • c$distribute a(block), b(block) • c-----initialization • do i = 1, n • a(i) = 1.0 - 0.5*i • b(i) = -10.0 + 0.01*(i*i) • enddo • c-----real work • do it = 1, niters • q = 0.01*it • c$doacross local(i), shared(a,b,q), affinity (i) = data(a(i)) • do i = 1, n • a(i) = a(i) + q*b(i) • enddo • enddo
Explicit Placement: dprof / dplace • Used for application that don’t use libmp (ie. explicit sproc, fork, pthreads, mpi, etc) • dprof: profiles memory access pattern • dplace can: • Change the page size used • Enable page migration • Specify the topology used by the threads of a parallel program • Indicate resource affinities • Assign memory ranges to particular nodes
SGI 3rd Generation ccNUMA Implementation Origin 3000 Family
Compute Module vs. Bricks C-Brick R-Brick P-Brick 8P12 Compute Module (Origin 2000) System “Bricks” (Origin 3000)
Taking Advantage of Multiple CPUs Parallel Programming Models Available on Origin Family
Many Different Models and Tools To Choose From • Automatic Parallelization Option: compiler flags • Compiler Source Directives: OpenMP, c$doacross, etc • explicit multi-threading: pthreads, sproc • Message Passing APIs: MPI, PVM
Computing Value of π: Simple Serial • program compute_pi • integer n, i • double precision w, x, sum,pi, f, a • c function to integrate • f(a) = 4.d0 / (1.d0 + a*a) • print *, ‘Enter number of intervals:’ • read *,n • c calculate the interval size • w =1.0d0/n • sum = 0.0d0 • do i = 1,n • x = w * (i - 0.5d0) • sum = sum + f(x) • end do • pi = w * sum • print *, ‘computed pi =‘ ,pi • stop • end
Automatic Parallelization Option • Add-on option for SGI MipsPro compilers • compiler searches for loops that it can parallelize • f77 -apo compute_pi.f77 • setenv MP_SET_NUM_THREADS 4 • ./a.out
OpenMP Source Directives • program compute_pi • integer n, i • double precision w, x, sum,pi, f, a • c function to integrate • f(a) = 4.d0 / (1.d0 + a*a) • print *, ‘Enter number of intervals:’ • read *,n • c calculate the interval size • w =1.0d0/n • sum = 0.0d0 • !$OMP PARALLEL DO PRIVATE(X), SHARED(W), REDUCTION(+:sum) • do i = 1,n • x = w * (i - 0.5d0) • sum = sum + f(x) • end do • !$OMP END PARALLEL DO • pi = w * sum • print *, ‘computed pi =‘ ,pi • stop • end
Message Passing Interface (MPI) • program compute_pi • Include ‘mpif.h’ • integer n, i, myid, numprocs, rc • double precision w, x, sum,pi, f, a • c function to integrate • f(a) = 4.d0 / (1.d0 + a*a) • call MPI_INIT(ierr) • call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) • call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) • if (myid .eq. 0) then • print *, ‘Enter number of intervals:’ • read *,n • endif • call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr) • c calculate the interval size • w =1.0d0/n • sum = 0.0d0 • do i = myid+1, n, numprocs • x = w * (i - 0.5d0) • sum = sum + f(x) • end do
Message Passing Interface (MPI) • mypi = w * sum • c collect all the partial sums • call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, • $MPI_COMM_WORLD,ierr) • c node 0 prints the answer • if (myid .eq. 0) then • print *, ‘computed pi =‘ ,pi • endif • call MPI_FINALIZE(rc) • stop • end