Compiler, Languages, and Libraries

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh 810181079 hadi@cad.ece.ut.ac.ir

Introduction • Distributed systems are heterogeneous: • Power • Architecture • Data Representation • Data access latency are significantly long and vary with underlaying network traffic • Network bandwidths are limited and can vary dramatically with the underlaying load

Programming Support Systems: Principles • Principle: each component of the system should do what it does best • The application developer should be able to concentrate on problem analysis and decomposition at a fairly high level of abstraction

Programming Support Systems: Goals • They should make applications easy to develop • Build applications that portable across different architectures and computing configurations • Achieve high performance close to what an expert programmer can achieve using the underlaying features of the network and computing configurations • Exploits various forms of parallelism to balance across a heterogeneous configuration • Minimizing the computation time • Matching the communication to the underlaying bandwidths and latencies • Ensure the performance variability remains within certain bounds

Autoparallelization • The user focuses on what is beingcomputed rather than How • Performance penalty should not be worse rather than a factor of two • Automatic vectorization • Dependence analysis • Asynchronous (MIMD) Parallel Processing • Symmetric multiprocessor (SMP)

Distributed Memory Architecture • Caches • Higher latency of large memories • Determine how to apportion data to the memories of processors in away that • Maximize local memory access • Minimize communication • Regions of parallel execution had to be large enough to compensate for the overhead of initiating and synchronization • Interprocedural analysis and optimization • Mechanisms that involve the programmer in the design of the parallelization as well as the problem solution will be required

Explicit Communication • Message passing to get data from remote memories • Single version of program runs on the all processors • The computation is specialized to specific processors through extracting number of processor and indexing its own data

Send-Receive Model • A shared-memory environment • Each processor not only receives its needed data but also sends data other ones require • PVM • MPI

Get-Put Model • The processor that needs data from a remote memory is able to explicitly get it without requiring any explicit action by the remote processor

Discussion • Program is responsible for: • Decomposition of computation • The power of individual processor • Load balancing • Layout of the memory • Management of latency • Organization and optimization of communication • Explicit communication can be though of as an assembly language for grids

Distributed Shared Memory • DSM as a vehicle for hiding complexities of memory and communication management • Address space is as flatten as a single-processor machine for programmer • The hardware/software is responsible for data retrieval through generating needed communications, from remote memories

Hardware Approach • Stanford DASH, HP/Convex Exemplar, SGI Origin • Local cache misses initiate data transfer from remote memory if needed

Software Scheme • Shared Virtual Memory, TreadMark • Rely on paging mechanism in the operating system • Transfer whole page on demand between operating systems • Make granularity and latency significantly large • Used in conjunction with relaxed memory consistency models and support for latency hiding

Discussion • Programmer is free from handling thread packaging and parallel loops • Has performance penalties and then is useful for coarser-grained parallelism • Works best with some help from the programmer on the layout of memory • Is a promising strategy for simplifying the programming model

Data-Parallel Languages • High performance on distributed memory: • Allocate data to various processor memory to maximize locality and minimize communication • For scaling parallelism to hundreds or thousands of processors data parallelism is necessary • Data parallelism: subdividing the data domain in some manner and assigning the subdomains to different processors (data layout) • These are the foundations for data-parallel languages • Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++ • High Performance Fortran (HPF), and High Performance C++ (HPC++)

HPF • Provides directives for data layout on F’90 and F’95 • Directives have no effect on the meaning of the program • Advices for compiler on how to assign elements of the program arrays and data structures to different processors • These specification is relatively machine independent • The principle focus is the layout of arrays • Arrays are typically associated with the data domains of underlying problem • The principle drawback: limited support for problems on irregular meshes • Distribution via run-time array • Generalized block distribution (blocks to be of different sizes) • For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block distribution)

HPC++ • Unsynchronized for-loops • Parallel template libraries, with parallel or distributed data structures as basis

Task Parallelism • Different components of the same computation are executed in parallel • Different tasks can be allocated to different nodes of the grid • Object parallelism (Different tasks may be components of objects of different classes) • Task parallelism need not be restricted to shared-memory systems and can be defined in terms of communication library

HPF 2.0 Extensions for Task Parallelism • Can be implemented on both shared- and distributed-memory systems • Providing a way for a set of cases to be run in parallel with no communication until synchronization at the end • Remaining problems on using HPF on a computational grid: • Load matching • Communication optimization

Coarse-Grained Software Integration • Complete application is not a simple program • It is a collection of programs that must all be run, passing data to one another • The main technical challenge of the integration is how to prevent performance degradation due to sequential processing of the various programs • Each program could be viewed as a task • Tasks collected and matched to the power of the various nodes in the grid

Latency Tolerance • Dealing with long memory or communication latencies • Latency hiding: data communication is overlapped with computation (software-perfecting) • Latency reduction: programs are reorganized to reuse more data in local memories (loop blocking for cache) • More complex to implement on heterogeneous distributed computers • Latencies are large and variable • More time to be spent on estimating running times

Load Balancing • Spreading the calculation evenly across processors while minimizing communication • Simulated annealing, neural nets • Recursive bisection: at each stage, the work is divided into two equal parts. • For Grid: power of each node must be taken in the account • Performance prediction of components is essential

Runtime Compilation • A problem with automatic load-balancing (especially on irregular grids) • Unknown loop upper bounds • Unknown array sizes • Inspector/executer model • Inspector: executed a single time once the runtime, establishes a plan for efficient execution • Executor:executed on each iteration, carries out the plan defined by inspector

Libraries • Functional library: the parallelized version of standard functions are applied to user-defined data structures (ScaLAPACK, FFTPACK) • Data structure library: aparallel data structure is maintained within the library whose representation is hidden from the user (DAGH) • Well suited for OO languages • Provides max flexibility to the library developer to manage runtime challenges • Heterogeneous networks • Adaptive girding • Variable latencies • Drawback: their components are currently treated by compilers as black boxes • Some sort of collaboration between compiler and library might be possible, particularly in an interprocidural compilation

Programming Tools • Tools like Pablo, Gist and Upshot can show where performance bottlenecks exist • Performance-tuning tools

Future Directions (Assumptions) • The user is responsible for both problem decomposition and assignment • Some kind of service negotiator runs prior the execution and determines the available nodes and their relative power • Some portion of compilation will be invoked after this service

Task Compilation • Constructing a task graph, along with an estimation of running time for each task • TG construction and decomposition • Performance Estimation • Restructuring the program to better suit the target grid configuration • Assignments of components of the TG to the available nodes • Java

Grid Shared Memory (Challenges) • Different nodes has different page sizing and paging mechanisms • Good Performance Estimation • Managing the system level interaction providing DSM

Global Grid Compilation • Providing a programming language and compilation strategy targeted to grid • Mixture of parallelism styles, data parallelism and task parallelism • Data decomposition • Function decomposition

Compiler, Languages, and Libraries