440 likes | 629 Views
Porting NANOS on SDSM. GOAL Porting a shared memory environment to distributed memory. What is missing to current SDSM ?. Christian Perez. Who am i ? . December 1999 : PhD at LIP, ENS Lyon, France Data parallel languages, distributed memory, load balancing, preemptive thread migration
E N D
Porting NANOS on SDSM GOAL Porting a shared memory environment to distributed memory. What is missing to current SDSM ? Christian Perez
Who am i ? • December 1999 : PhD at LIP, ENS Lyon, France • Data parallel languages, distributed memory, load balancing, preemptive thread migration • Winter 1999/2000 : TMR at UPC • OpenMP, Nanos, SDSM • October 2000 : INRIA researcher • Distributed programs, code coupling
Contents • Motivation • Related works • Nanos execution model (NthLib) • Nanos on top of 2 SDSM (JIAJIA & DSM-PM2) • Missing SDSM functionalities • Conclusion
Motivation • OpenMP : emerging standard • simplicity (no data distribution) • Cluster of machines (mono or multiprocessors) • excellent ratio performance / price • OpenMP on top of a cluster !
OpenMP / Cluster : HOW ? • OpenMP paradigm : shared memory • Cluster paradigm : message passing • Use of software DSM system ! • Hardware DSM system : SCI (write: 2 s) • specific hardware • not yet stable
Related work • Several OpenMP/DSM implementations • OpenMP NOW!, Omni • But, • Modification of OpenMP semantics • One level of parallelism • Do not exploit high performance networks
OpenMP on classical DSM • Compiler extracts shared data from stack • Expensive local variable creation • shared memory allocation • Modification of OpenMP standard : • default should be private instead of being shared variables • New synchronization primitives : • condition variables & semaphores
OpenMP on classical DSM • One level of parallelism (SPMD) !$omp parallel do do i = 1,4 x(i) = x(i) + x(i+1) end do call schedule(lb, up, …) do i = lb, ub x(i) = x(i) + x(i+1) end do call dsm_barrier() barrier
Omni compilation approach Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/
Our goals • Support OpenMP standard • High performance • Allow exploitation of • multithreading (SMP) • high performance networks
Nanos OpenMP compiler • Convert an OpenMP program to a task graph • Communications via shared memory !$omp parallel do do i = 1,4 x(i) = x(i) + x(i+1) end do i=1,2 i=3,4
NthLib runtime support • Nanos compiler generates intermediate codes • Communications still via shared memory call nthf_depadd(…) do nth_p = 1, proc nth= nthf_create_1s(…,f,…) done call nth_block() subroutine f(…) x(i) = x(i) + x(i+1)
NthLib details • Assumes to run on top of kernel threads • Provides user-level threads (QT) • Stack management (allocate) • Stack initialization (argument) • Explicit context switch
Nthlib queues • Global/Local • Thread descriptor • Rich functionalities • Work descriptor • High performance
Nthlib : Memory management Nano-thread descriptor Successors Stack Guard zone Mutal exclusion mmap allocation SLOT_SIZE stack alignment
Porting Nthlib to SDSM Data consistency Shared memory management Nanos threads JIAJIA implementation DSM-PM2 implementation Summary of DSM requirements
Data consistency • Mutual exclusion for defined data structures Acquire/Release • User level shared memory data Barrier
Data consistency • Mutual exclusion for defined data structures Acquire/Release • User level shared memory data Barrier barrier barrier barrier
Shared memory management • Asynchronous shared memory allocation • Alignment parameter (> PAGE_SIZE) • Global variables/commondeclaration Not yet supported
Nano-threads • Run-to-block execution model • Shared stacks (father/sons relationship) • Implicit thread migration (scheduler)
JIAJIA • Developed at China by W. Hu, W. Shi & Z. Tang • Public domain DSM • User level DSM • DSM : lock/unlock, barrier, cond. variables • MP : send/receive, broadcast, reduce • Solaris, AIX, Irix, Linux, NT (not distributed)
JIAJIA : Memory Allocation • No control of memory alignment (x2) • Synchronous memory allocation primitive Development of an RPC version • Based on send/receive primitive • Add of a user level message handler Problems • Global lock • Interference with JIAJIA blocking function
JIAJIA : Discussion • Global barrier for data synchronization Not multiple levels of parallelism • No thread aware No efficient use of SMP nodes
DSM/PM2 • Developed at LIP by G. Antoniu (PhD student) • Public domain • User level, module of PM2 • Generic and multi-protocol DSM • DSM : lock/unlock • MP : LRPC • Linux, Solaris, Irix (32 bits)
PM2 organization MAD1 TCP PVM MPI SCI VIA SBP MARCEL MONO SMP ACTIVATON PM2 DSM TBX NTBX MAD2 TCP MPI SCI VIA BIP http://www.pm2.org
DSM/PM2 : Memory Allocation • Only static memory allocation Build dynamic memory allocation primitive • Centralized memory allocation • LRPC to Node 0 Integration of alignment parameter Summer 2000 : dynamic memory allocation ready !
DSM/PM2 : marcel descriptor Page boundary marcel_t (sp&MASK)+SLOT_SIZE NthLib requirement : a kernel thread many nano-threads
DSM/PM2 : marcel descriptor Page boundary marcel_t (sp&MASK)+SLOT_SIZE marcel_t* Page boundary *((sp&MASK)+SLOT_SIZE)
DSM/PM2 : Discussion • Using page level sequential consistency + no need of barrier (Multiple levels of parallelism) – False sharing Dedicated stack layout marcel_t* Page boundary Pad Page boundary
DSM/PM2 : Discussion (cont) • No alternate stack for signal handler Prefetch page before context switch : O(n) Pad to next page before opening parallelism Page boundary Shared data Pad Page boundary
DSM/PM2 improvement • Availability of an asynchronous DSM malloc • Lazy data consistency protocol in evaluation • eager consistency, multiple writer • scope consistency • Support for stack in shared memory (LINUX)
DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE
DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE
DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE
DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE
DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE
DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE
DSM requirement • Support of static global shared variables • Efficient code • remove one indirection level • Enable use of classical compiler • Support for common « Sharedization » of already allocated memory dsm_to_shared(void* p, size_t size);
DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock
DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock barrier barrier
DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock barriers barrier
DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock start(1) start(2) stop(1) stop(2) update(1,2)
Summary of DSM requirements • Support of static global shared variables « Sharedization » of already allocated memory • Acquire/release primitive • Partial barrier group management • Asynchronous shared memory allocation • Alignment parameter to memory allocation • Threads (SMP nodes) • Optimized stack management
Conclusion • Successfully port Nanos to 2 DSM JIAJIA & DSM-PM2 • DSM requirement to obtain performance Support MIMD model Automatic thread migration • Performance ?