380 likes | 547 Views
SMiLE. SMiLE Shared Memory Programming. Wolfgang Karl, Martin Schulz Lehrstuhl für Rechnertechnik und Rechnerorganisation, LRR Technische Universität München SCI Summer School, Trinity College Dublin October 3 rd , 2000. SMiLE Project at LRR .
E N D
SMiLE SMiLE Shared Memory Programming Wolfgang Karl, Martin Schulz Lehrstuhl für Rechnertechnik und Rechnerorganisation, LRR Technische Universität München SCI Summer School, Trinity College Dublin October 3rd, 2000
SMiLE Project at LRR • Computer Architecture Group (W. Karl, M. Schulz) • SMiLE: Shared Memory in a LAN-like Environment • Programming Environments and Tools • SCI Hardware Developments • http://smile.in.tum.de/ • Lehrstuhl für Rechnertechnik und Rechnerorganisation, Prof. Dr. Arndt Bode • Parallel Processing and Architectures • Tools and Environments for Parallel Processing • Applications • Parallel Architectures Wolfgang Karl 3. Oktober 2000
Outline • Parallel Processing: Principles • SMiLE Software Infrastructure • Focus on Communication Architecture • SMiLE Tool Environment • Data Locality Optimizations • Continuation by Martin • Shared Memory Programming on SCI Wolfgang Karl 3. Oktober 2000
Parallel Processing • Parallel Computer Architectures • Shared Memory Machines • Distributed Memory Machines • Distributed Shared Memory • Parallel Programming Models • Shared Memory Programming • Message Passing • Data Parallel Programming Model Wolfgang Karl 3. Oktober 2000
Shared Memory Multiprocessors CPU CPU CPU Cache Cache Cache Interconnection Network (Bus, Crossbar, Multistage) Memory Module Memory Module Centralized shared memory Memory Module • Global address space • Uniform Memory Access (UMA) • Communication / synchronization via shared variables Wolfgang Karl 3. Oktober 2000
Parallel Programming Models (1) • Shared Memory • Single global virtual address space • Easier programming model • Implicit data distribution • Support for incremental parallelization • Mainly on tightly coupled systems • SMPs, CC-NUMA • Pthreads, SPMD, OpenMP, HPF Wolfgang Karl 3. Oktober 2000
Shared Memory Flow: Timing Write to shared buffer and set flag Detect flag set and read message from buffer Time Producer Thread: for (i:=0; i<num_bytes; i++) buffer[i]:=source[i] Flag:=num_bytes Consumer Thread: while (flag==0); for (i:=0; i<flag; i++) dest[i]:=buffer[i]; Wolfgang Karl 3. Oktober 2000
Distributed Memory, DM Network Interface Network Interface Memory Memory CPU CPU Cache Cache Interconnection Network • No remote memory access (NORMA) • Communication: message passing • MPP, NOWs, clusters: Scalability Wolfgang Karl 3. Oktober 2000
Parallel Programming Models (2) • Message Passing • Predominant paradigm for DM machines • Straightforward resource abstraction • High-level communication libraries: PVM, MPI • Exploiting underlying interconnection networks • Complex and more difficult for user • Explicit data distribution and parallelization • But:Performance tuning more intuitive Wolfgang Karl 3. Oktober 2000
Message Passing Flow: Timing Sender Send a message • OS call • Protection check • Program DMA Receiver DMA to NI DMA from net to system buffer OS interrupt and message decode Time OS copy from system buffer to user buffer Reschedule user process Receive message Producer Process: send(proci,processi,@sbuffer,num_bytes) Consumer Process: Receive(@rbuffer,max_bytes) Wolfgang Karl 3. Oktober 2000
Distributed Shared Memory, DSM Memory Memory Memory Memory CPU CPU Cache Cache • Distributed memory, shared by all processors • NUMA: non-uniform memory access, CC-NUMA, COMA • Combines support for shared memory programming with scalability Interconnection Network Wolfgang Karl 3. Oktober 2000
Parallel Computer Architecture • Trends in parallel computer architecture • Convergence towards a generic parallel machine organization • Use of commodity-off-the-shelf components • Low-cost parallel processing • Comprehensive and high-level development environments Wolfgang Karl 3. Oktober 2000
SCI-based PC clusters • NUMA architecture with commodity components • Hardware-supported DSM with low-latency remote memory access and fast message-passing • Competitive capabilities for less money • But new challenges for the software environment PCs with PCI-SCI adapter SCI interconnect Global address space Wolfgang Karl 3. Oktober 2000
SCI-Based Cluster-Computing • The SMiLE project at LRR-TUM • Shared Memory in a LAN-like Environment • System architecture • SCI-based PC cluster with NUMA characteristics • Software infrastructure for PC clusters • User-level communication architectures on top of SCI’s DSM • Providing message-passing and transparent shared memory on a single platform • Tool environment Wolfgang Karl 3. Oktober 2000
SMiLE software layers Target applications / Test suites High-levelSMiLE NTprot.stack SISALon MuSE SISCI PVM SPMD style model TreadMarkscompatible API Low-levelSMiLE AM 2.0 SS-lib CML SCI-VM lib SCI-Messaging User/Kernelboundary NDISdriver SCI Drivers &SISCI API SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Wolfgang Karl 3. Oktober 2000
SMiLE software layers Message Passing User Level Communication Target applications / Test suites High-levelSMiLE NTprot.stack SISALon MuSE SISCI PVM SPMD style model TreadMarkscompatible API Low-levelSMiLE AM 2.0 SS-lib CML SCI-VM lib SCI-Messaging User/Kernelboundary NDISdriver SCI driver &SISCI API SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Wolfgang Karl 3. Oktober 2000
Message passing using HW-DSM • SMiLE messaging layers • Active Messages, • User level sockets • Common Messaging Layer for PVM and MPI • User-level communication • Remove the OS from the critical path of sending and receiving messages • Mapping parts of the NI into the user’s address space • Avoid context switches and buffering • Direct utilization of the HW-DSM • Buffered remote writes Wolfgang Karl 3. Oktober 2000
Principles of the message engines Node A / Sender Node B / Receiver SendRingbuffer (mapped) Mapingvia SCI ReceiveRingbuffer mapped End Pointer End pointer +1 Sync End Pointer copy Start Pointer copy Sync Start Pointer mapped Start Pointer Wolfgang Karl 3. Oktober 2000
Implementation Remarks • Ring buffer setup • One pair of ring buffers for each connection • Avoiding high overhead access synchronization • On demand establishment of connections • Data transfer • Only pipelined, buffered remote writes • Avoiding inefficient, blocking reads • User level barriers to avoid IOCTL overhead Wolfgang Karl 3. Oktober 2000
SMiLE software layers True Shared MemoryProgramming Target applications / Test suites High-levelSMiLE NTprot.stack SISALon MuSE SISCI PVM SPMD style model TreadMarkscompatible API Low-levelSMiLE AM 2.0 SS-lib CML SCI-VM lib SCI-Messaging User/Kernelboundary NDISdriver SMiLE driver &Dolphin IRM driver SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Target applications / Test suites High-levelSMiLE NTprot.stack SISALon MuSE SISCI PVM SPMD style model TreadMarkscompatible API Low-levelSMiLE AM 2.0 SS-lib CML SCI-VM lib SCI-Messaging User/Kernelboundary NDISdriver SCI Drivers & SISCI API SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Wolfgang Karl 3. Oktober 2000
DSM performance monitoring (1) • Performance aspects of DSM systems: • High communication performance via hardware-supported DSM • Remote memory accesses are an order of magnitude higher than local ones • Data locality should be enabled or exploited by programming systems or tools Wolfgang Karl 3. Oktober 2000
DSM performance monitoring (2) • Monitoring aspects of DSM systems: • Capture information about the dynamic behavior of a parallel program • Fine-grain communication • Communication might occur implicitly on every read or write • Avoid probe effect through software instrumentation Wolfgang Karl 3. Oktober 2000
SMiLE Monitoring Approach • Event-driven hybrid monitoring system • Network interface with monitoring hardware • Delivers information about the runtime and communication behavior to tools for performance analysis and debugging • Allows on-line steering • Hardware monitor exploits spatial and temporal locality of accesses Wolfgang Karl 3. Oktober 2000
The SMiLE Hardware Monitor SCI out PCI-SCI Bridge Probe SCI in SMiLE Monitor B-Link Interface SCI Network Counter Module PCI local bus DynamicCounterArray Static Counter Array EventFilter PCI Unit Wolfgang Karl 3. Oktober 2000
Features of the SMiLE Monitor • Dynamic monitoring mode • Used on whole physical address space • Creation of global access heuristics • Cache-like swap logic to save hardware resources • Automatic aggregation of neighboring areas • Static monitoring mode • Used on predefined memory areas • Flexible event logic Wolfgang Karl 3. Oktober 2000
Monitor’s dynamic mode Dynamic Counter Array tag tag tag tag tag Counter #1 Counter #3 Counter #4 Counter #5 Counter #2 Memory reference hit? Interrupt tail ptr Main memory ring buffer ring buffer stuffed? head ptr mov %esi, %ecx mov %dx, (%esi) mov %esi, %ebx add %dx, (%esi) pushf Wolfgang Karl 3. Oktober 2000
Information delivered • All information acquired is based on SCI packets • Physical addresses • Source/Target IDs • Information can not be used directly • Physical addresses inappropriate • Back-translation to source code level necessary • Need for a monitoring infrastructure • Access to mapping & symbol information • Clean monitoring interface Wolfgang Karl 3. Oktober 2000
OMIS: Goals and Design • Goal: Flexible monitoring for distributed systems • Specify interface to be used by tools • Decouple tools and monitor system • Increased portability and availability of tools • OMIS Approach • Interface based on Event-Action paradigm • Events: When should something happen? • Action: What should happen? • OMIS provides default set of Events and Actions • Tools define relations between Events and Actions Wolfgang Karl 3. Oktober 2000
Putting it all together Multi-layeredMonitoringInfrastructure Multi-layeredMonitoringInfrastructure SMiLEHW-DSM Monitor ExtensibleMonitoring API Global VirtualMemory for Clusters Wolfgang Karl 3. Oktober 2000
Multi-Layered SMiLE monitoring Tools Prog. Environment Extension High-level Prog. Environment (Specific information) Prog. Model Extension Shmem Programming Model (Specific information) OMIS / OCM Core OMIS SCI DSM Extension SyncMod OMIS/OCM Monitor for DSM systems SCI-VM SMiLE PCI-SCI bridge and monitor Node Local Resources (Statistics of Synchronization mechanisms) (Virt./Phys. Address mappings, Statistics) (Physical addresses, Node IDs, Counters, Histograms) (CPU counters, Cache statistics, OS information) Wolfgang Karl 3. Oktober 2000
Advantages • Comprehensive DSM monitoring • Utilization of information from all components • Structure of execution environment maintained • Generic Shared Memory monitoring • Small Model specific extensions • Flexibility and Extensibility • Profit from existing OMIS environment • Easy implementation • Utilization of existing rich tool base Wolfgang Karl 3. Oktober 2000
Current Status and Future Work • SCI Virtual Memory • Prototype completed • Work on larger infrastructure in progress • SMiLE Hardware Monitor • Prototype is currently being tested • Simulation environment available • OMIS • OMIS definition and OCM core completed • DSM extension in development Wolfgang Karl 3. Oktober 2000
Data locality optimizations • Using the monitor’s static mode • monitoring predefined memory sections • integration of the monitoring concept into the programming model • translation of the application’s data structures into physical addresses for the hardware monitor • putting the monitoring results into relation with the source code information • evaluation of the network behavior and data locality Wolfgang Karl 3. Oktober 2000
Example application • SPLASH benchmark suite: LU kernel • implements a blocked version of an LU decomposition for dense matrices • solves a system of linear equations • splits the data structure into subblocks of 16x16 values • LU decomposition • split into phases, one for each block • for each phase: analysis of the remote memory accesses Wolfgang Karl 3. Oktober 2000
Simulation environment • Multiprocessor memory system simulator: LIMES • Shared memory system • local read/write access latency: 1 cycle • remote write latency: 20 cycles • DSM system with x86 nodes • memory distribution at page granularity Wolfgang Karl 3. Oktober 2000
Optimization results Unoptimized version Optimized version Wolfgang Karl 3. Oktober 2000
Summary • SMiLE Software Infrastructure • Parallel Processing Principles • SMiLE Software Infrastructure • Message Passing Communication • Shared Memory Programming • SMiLE Tool Environment • Based on Hardware Monitoring Wolfgang Karl 3. Oktober 2000
SMiLE software layers True Shared MemoryProgramming Target applications / Test suites High-levelSMiLE NTprot.stack SISALon MuSE SISCI PVM SPMD style model TreadMarkscompatible API Low-levelSMiLE AM 2.0 SS-lib CML SCI-VM lib SCI-Messaging User/Kernelboundary NDISdriver SMiLE driver &Dolphin IRM driver SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Target applications / Test suites High-levelSMiLE NTprot.stack SISALon MuSE SISCI PVM SPMD style model TreadMarkscompatible API Low-levelSMiLE AM 2.0 SS-lib CML SCI-VM lib SCI-Messaging User/Kernelboundary NDISdriver SCI Drivers & SISCI API SCI-VM SCI-Hardware: SMiLE & Dolphin adapter, HW-Monitor Wolfgang Karl 3. Oktober 2000