560 likes | 753 Views
Software Distributed Shared Memory (SDSM): MultiView SDSM, false sharing. Solution: MultiView. Granularity adaptation. Integrated services. Ayal Itzkovitz, Assaf Schuster. Local memory. core. core. core. core. A multi-core system (simplified).
E N D
Software Distributed Shared Memory (SDSM): • MultiView • SDSM, false sharing. • Solution: MultiView. • Granularity adaptation. • Integrated services. • Ayal Itzkovitz, Assaf Schuster DSM Innovations - MultiView
Local memory core core core core A multi-core system (simplified) • A parallel program may spawn processes (threads) in order to utilize all computing units • Processes communicate through shared memory, physically located on the local machine DSM Innovations - MultiView
Virtual Shared Memory Local memory Local memory core core A distributed system • Emulation of the same programming paradigm • Ultimately: no changes to source/binary code Local memory core Network DSM Innovations - MultiView
The First SDSM System • The first software SDSM system, Ivy [Li & Hudak, Yale, ‘86] • Strict memory semantics (Lamport’s sequential consistency) • Page-based: memory pages as units of sharing • The major performance limitation: Page size False sharing • Page size – 4K (and more) • Average object size – 28 bytes About 150 objects on a page DSM Innovations - MultiView
Object Distribution DSM Innovations - MultiView
Object Distribution – Memory View Network DSM Innovations - MultiView
False Sharing “…the conventional wisdom remainsthat the overhead of false sharing[…] in page-based consistency protocolsis the primary factor limiting the performance of software SDSM” [Amza, Cox, Ramajamni, and Zwaenepoel, PPoPP ‘97] “[The] conventional wisdom holds that fine-grain performance and false sharingdoom page-based approaches” [Buck and Keleher, IPPS ‘98]
Solution: The MultiView Approach • “MultiView and Millipage – Fine-grain Sharing in Page-based SDSMs” [Itzkovitz and Schuster, OSDI ‘99] • Implement small-size pages through special memory configuration Other Goals: • W/O compromising the strict memory consistency [ICS’04, EuroPar’04] • Utilizing low-latency networks (Myrinet, VIA/ServerNet-II, Infiniband) [Hot-Interconnects’03, IPDPS’04] • Transparency [EuroPar’03] • Adaptive sharing granularity [ICPP’00, IPDPS’01 best paper] • Maximize locality through migration and load sharing [DISC’01] • Additional “service layers” (garbage collection, data-race detection) [JPDC’01,JPDC02] DSM Innovations - MultiView
x y z w v u The Traditional Memory Layout struct a { …};struct b; int x, y, z; main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b)); …} Traditional DSM Innovations - MultiView
x x y y z z w w v v u u The MultiView Technique MultiView Traditional DSM Innovations - MultiView
Protection is now set independently x x y y z z RW NA R w w v v u u The MultiView Technique Variables reside in the same page but are not shared MultiView Traditional DSM Innovations - MultiView
Memory Object x x View 1 y y z z View 2 View 3 w w v v u u The MultiView Technique MultiView Traditional DSM Innovations - MultiView
MemoryObject x View 1 y z View 2 View 3 w v u MultiView The MultiView Technique View 1 Memory Object View 2 View 3 Memory Layout DSM Innovations - MultiView
The MultiView Technique R R View 1 NA RW View 1 Memory Object Memory Object NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations - MultiView
The MultiView Technique R R View 1 NA RW View 1 NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations - MultiView
Enabling Technology Memory mapped I/O created for inter-process communication SharedMemoryObject DSM Innovations - MultiView
SharedMemoryObject Implementation: Millipage Can be used by a single process to provide desired functionality • Windows-NT (Solaris, BSD, Linux) • CreateFileMapping(), MapViewOfFileEx() • for allocating views DSM Innovations - MultiView
mat = malloc(lines*sizeof(int*));for(i=0;i<N;i++) mat[i] = malloc(cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … mat = malloc(lines*cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … Transparency • 1999: • Minipages are allocated at malloc time (via malloc-like API) • Allocation routines should be slightly modified • SOR and LU have not been modified at all • WATER- changed ~20 lines out of 783 lines • IS- changed 5 lines out of 93 lines • TSP- changed ~15 lines out of ~400 lines • 2003: complete transparency • Through binary instrumentation/interception of OS calls DSM Innovations - MultiView
SOR SPLASH-II Benchmark DSM Innovations - MultiView
Performance with Fixed Granularity(NBodyW on 8 nodes) DSM Innovations - MultiView
False Sharing vs. Prefetching (WATER) DSM Innovations - MultiView
Adapting Granularity Shared data elements Application run time Adaptation is dynamic, automatic, transparent DSM Innovations - MultiView
Water-nsq speedup (one thread per node) Water-nsq speedup (two threads per node) 12 24 22 10 20 18 8 16 14 speedup speedup 6 12 10 4 8 6 2 4 2 0 0 1 2 4 6 8 10 12 1 2 4 6 8 10 12 nodes nodes SC/MV - fine granularity HLRC Mixed consistency SC/MV - best static granularity SC/MV - dynamic granularity Performance (VIA/ServerNet-II, 2004) DSM Innovations - MultiView
Integrating Data Race Detection • Detection in application variable granularity DSM Innovations - MultiView
Integrating Distributed Garbage Collection(Remote Reference Counting) • Collection in native application granularity. DSM Innovations - MultiView
Questions? DSM Innovations - MultiView
Types of Parallel Systems Communication Efficiency • In-core multi-threading • Multi-core/SMP multi-threading • Tightly-coupled cluster, customized interconnect (SGI’s Altix) • Tightly-coupled cluster, of-the-shelf interconnect (InfiniBand) • WAN, Internet, Grid, peer-to-peer Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only. HDSM: systems in 3 move towards presenting a shared memory interface to a physically distributed system. What about 4,5? Software Distributed Shared Memory = SDSM Scalability DSM Innovations - MultiView
A = malloc(MATSIZE);B = malloc(MATSIZE);C = malloc(MATSIZE); parfor(n) mult(A, B, C); mult(id): for (line=Nxid .. Nx(id+1)) for(col=0..N) C[line,col] = multline(A[line],B[col]); W Matrix Multiplication two threads Read/only matrices Write matrix R R DSM Innovations - MultiView
RW RO RO RW RO RO RO RO RO RO RO RO RW RW RO RO Matrix Multiplication A A Sent once x x Sent once B B = = C C Network DSM Innovations - MultiView
RO RO RW RW RO RO RO RO RO RO RO RO RW RO RO RW Matrix Multiplication R R W A A x x B B = = C C Network DSM Innovations - MultiView
RO RO RO RO RO RW RO RW RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing Sent once A A x x B Sent once B = = C C Network DSM Innovations - MultiView
RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations - MultiView
RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations - MultiView
R R W RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RO RO RW RW RW RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations - MultiView
Apply diff Apply diff RW RW First Approach: Weak Semantics • Example - Release Consistency: • Allow multiple writers to page (assume exclusive update for any portion of the page) • Each page has a twin copy • At synchronization time, all pages perform “diff” with their twins, and send diffs to managers • Managers hold master copies twin twin DSM Innovations - MultiView
First Approach: Weak Semantics • Allow memory to reside in an incosistent state for time intervals • Enforce consistency only at synchronization points • Reaching a consistent view of the memory requires computation • Reduces (but not always eliminate) false sharing • Reduces number of protocol messages • Weak memory semantics • Involves both memory and processing time overhead • Still: coarse-grain sharing (why diff at locations not touched? ) DSM Innovations - MultiView
Software DSM Evolution - Weak Semantics Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Brazos, ‘97Scope Cons.Rice DSM Innovations - MultiView
Software DSM Evolution - Multithreading Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations - MultiView
Second Approach:Code Instrumentation • Example - Binary Rewriting: • wrap each load and store with instructions that check whether the data is available locally push ptr[line]call __check_rload r1, ptr[line]push ptr[v]call __check_rload r2, ptr[v] add r1, 3hpush ptr[line]call __check_wstore r1, ptr[line]push ptr[line]call __donesub r2, r1push ptr[v]call __check_wstore r2, ptr[v]push ptr[v]call __done line += 3; v = v - line; push ptr[line]call __check_wload r1, ptr[line]push ptr[v]call __check_wload r2, ptr[v] add r1, 3hstore r1, ptr[line]push ptr[line]call __donesub r2, r1store r2, ptr[v]push ptr[v]call __done Compile CodeInstr. load r1, ptr[line]load r2, ptr[v] add r1, 3hstore r1, ptr[line]sub r2, r1store r2, ptr[v] Opt. DSM Innovations - MultiView
Second Approach:Code Instrumentation • Provides fine-grain access control, thus avoids false sharing • Bypasses the page protection mechanism • Usually, fixed granularity for all application data (Still, false sharing ) • Needs a special compiler or binary-level rewriting tools • Cost: • High overheads (even on single machine) • Inflated code • Not portable (among architectures) DSM Innovations - MultiView
Software DSM Evolution Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Fine-grain:Code Instrumentation Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Blizzard, ‘94binary instrumentationWisconsin Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Shasta, ‘97transparent,works forcommercial appsDigital WRL Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations - MultiView
MultiView - Overheads • Application:traverse an array of integers, all packed up in minipages • The number of minipages is derived from the value of max views in page • Limitations of the experiments: • 1.63GB contiguous address space available • Up to 1664 views • Need 64 bits!!! DSM Innovations - MultiView
Num views MultiView - Overheads • As expected, committed (physical) memory is constant • Only a negligible overhead (< 4%): Due to TLB misses DSM Innovations - MultiView
2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs DSM Innovations - MultiView
2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs SDSM DSM Innovations - MultiView
The Transparent DSM: System Initialization • For most DSM systems, initialization is an almost trivial task • The transparent DSM system cannot use such a simple solution • In order to initialize a DSM system transparently we have to inject the initialization code into the loaded application DSM Innovations - MultiView
crtStartup: main: … call c_init … call main … … application code … Standard Initialization Startup code from in the C standard library. This code is identical for all C applications. crtStartup is the entry point of the executable. Initialize the C runtime library Start the application This instruction lies at a fixed offset from crtStartup. We denote this offset as main_call_offset Standard C application DSM Innovations - MultiView
DllMain: … crtStartup = get_entry_point(); mainPtr = *(crtStartup + main_call_offset); *(crtStartup + main_call_offset) = hookedMain; … crtStartup: main: … call c_init … call main … … application code … mainPtr dd NULL hookedMain: dsm_init(…); dsm_create_thread(…,mainPtr,…); … Transparent DSM System Initialization The OS passes control to DllMain() after the DLL has been loaded The main thread is resumed Initialize the C runtime library Initialize the DSM system (the OS API is intercepted, globals are moved to the DSM) The application main thread is created using the DSM system thread creation API hookedMain main Injected DLL DSM Innovations - MultiView
SDSMs on Emerging Fast Networks • Fast networking is an emerging technology • MultiView provides only one aspect: reducing message sizes • The next magnitude of improvement shifts from the network layer to the system architectures and protocols that use those networks • Challenges: • Efficiently employ and integrate fast networks • Provide a “thin” protocol layer: reduce protocol complexity, eliminate buffer copying, use home-based management, etc. DSM Innovations - MultiView
x y z RW NA R RW Adding the Privileged View • Constant Read/Write permissions • Separate application threads from SDSM injected threads • Atomic updates • DSM threads can access (and update) memory while application threads are prohibited • Direct send/receive • Memory-to-memory • No buffer copying Memory Object Application Views The Privileged View DSM Innovations - MultiView