COMP60611 Fundamentals of Parallel and Distributed Systems

COMP60611Fundamentals of Paralleland Distributed Systems Lecture 3 Introduction to Parallel Computers John Gurd, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Overview • We focus on the lower of the two implementation-oriented Levels of Abstraction • The Computer Level • Von Neumann sequential architecture • two fundamental ways to go parallel • shared memory • distributed memory • implications for programming language implementations • Summary

Conventional View of Computer Architecture • We start by recalling the traditional view of a state-based computer, as first expounded by John von Neumann (1945). • A finite word-at-a-time memory is attached to a Central Processing Unit (CPU) (a “single core” processor). The memory contains the fixed code and initial data defining the program to be executed. The CPU contains a Program Counter (PC) which points to the first instruction to be executed. The CPU follows the instruction execution cycle, similar to the state-transition cycle described earlier for the Program Level. • The memory is accessed solely by requests of the following form: <read,address> or <write,data,address>

Conventional View of Computer Architecture This arrangement is conveniently illustrated by the following diagram:

Conventional View of Computer Architecture • A memory address is a unique, global identifier for a memory location – the same address always accesses (the value held in) the same location of memory. The range of possible addresses defines the address space of the computer. • All addresses 'reside' in one logical memory; there is therefore only one interface between CPU and memory. • Memory is often organised as a hierarchy, and the address space can be virtual; i.e. two requests to the same location may not physically access the same part of memory --- this is what happens, for example, in systems with cache memory, or with a disk-based paged virtual memory. Access times to a virtual-addressed memory vary considerably depending where the addressed item currently resides.

Parallel Computer Architecture • Imagine that we have a parallel program consisting of two threads or two processes, A and B. We need two program counters to run them. There are two ways of arranging this: • Time-sharing – we use a sequential computer, as described above, and arrange that A and B are periodically allowed to use the (single) CPU to advance their activity. When the “current” thread or process is deselected from the CPU, its entire state is remembered so that it can restart from the latest position when it is reselected. • Multiprocessor – we build a structure with two separate CPUs, both accessing a common memory. Code A and Code B will be executed on the two different CPUs. • Time-sharing is slow (and hence not conducive to high performance), but does get used in certain circumstances. However, we shall ignore it from now on.

Parallel Computer Architecture In diagrammatic form, the multiprocessor appears as follows:

Parallel Computer Architecture • The structure on the previous slide is known as a shared memory multiprocessor, for obvious reasons. • The memory interface is the same as for the sequential architecture, and both memory and addresses retain the same properties. • However, read accesses to memory now have to remember which CPU they came from so that the read data can be returned to the correct CPU.

Parallel Computer Architecture • But access to the common memory is subject to contention, when both CPUs try to access memory at the same time. The greater the number of parallel CPUs, the worse this contention problem becomes. • A commonly used solution is to split the memory into multiple banks which can be accessed simultaneously. This arrangement is shown below: .

Parallel Computer Architecture • In this arrangement, the interconnect directs each memory access to an appropriate memory bank, according to the required address. Addresses may be allocated across the memory banks in many different ways (interleaved, in blocks, etc.). • The interconnect could be a complex switch mechanism, with separate paths from each CPU to each memory bank, but this is expensive in terms of physical wiring. • Hence, cheap interconnect schemes, such as a bus, tend to be used. However, these limit the number of CPUs and memory banks that can be connected together (to a maximum of around 30).

Parallel Computer Architecture • Two separate things motivate the next refinement: • Firstly, we can double the capacity of a bus by physically co-locating a CPU and a memory bank and letting them share the same bus interface. • Secondly, we know from analysis of algorithms (and the development of programs from them) that many of the required variables are private to each thread. By placing private variables in the co-located memory, we can avoid having to access the bus in the first place. • Indeed, we don't really need to use a bus for the interconnect. • The resulting structure has the memory physically distributed amongst the CPUs. Each CPU-plus-memory resembles a von Neumann computer, and the structure is called a distributed memory multicomputer.

Parallel Computer Architecture The architecture diagram for a distributed memory multicomputer is shown below: memory1

Distributed Computer Architecture • Some distributed memory multicomputer systems have a single address space in which the available addresses are partitioned across the memory banks. • These typically require special hardware support in the interconnect. • Others have multiple address spaces in which each CPU is able to issue addresses only to its 'own' local memory bank. • Finally, interconnection networks range from very fast, very expensive, specialised hardware to ‘the Internet’.

Parallel Computer Architecture • The operation of the single address space version of this architecture, known as distributed shared memory (DSM) is logically unchanged from the previous schemes (shared memory multiprocessor). • However, some memory accesses only need to go to the physically attached local memory bank, while others, according to the address, have to go through the interconnect. This leads to different access times for different memory locations, even in the absence of contention. • This latter property makes distributed shared memory a non-uniform memory access (NUMA) architecture. For high performance, it is essential to place code and data for each thread or process in readily accessible memory banks. • In multiple address space versions of this architecture (known as distributed memory or DM), co-operative parallel action has to be implemented by message-passing software (at least at the level of the runtime system).

Parallel Computer Architecture • Note that cache memories can be used to solve some of the problems raised earlier; e.g. to reduce the bus traffic in a distributed shared memory architecture. • Many systems have a NUMA structure, but their single address space is virtual. This arrangement is sometimes referred to as virtual shared memory (VSM). • The effect of VSM can be implemented on a DM system entirely in software, in which case it is usually called distributed virtual shared memory (DVSM). • Most very large systems today consist of many shared memory, multicore nodes, connected via some form of interconnect.

The Advent of Multicore • A modern multicore processor is essentially a NUMA shared memory multiprocessor on a chip. • Consider a recent offering from AMD, the Opteron quad-core processor: • the next slides show a schematic of a single quad-core processor and a shared memory system consisting of four quad-core processors, i.e. a “quad-quad-core” system, with a total of 16 cores. • The number of cores per processor chip is rising rapidly (to keep up with “Moore’s law”). • Large systems connect thousands of multi-core processors.

Processor: Quad-Core AMD Opteron Source: www.amd.com, Quad-Core AMD Opteron Product Brief

AMD Opteron 4P server architecture Source: www.amd.com, AMD 4P Server and Workstation Comparison

Summary • The transition from a sequential computer architecture to a parallel computer architecture can be made in three distinct ways: • shared memory multiprocessor • distributed shared memory multiprocessor • distributed memory multicomputer • Nothing prevents more than one of these forms of architecture being used together. They are simply different ways of introducing parallelism at the hardware level. • Modern large systems are hybrids with shared memory structure at low level and distributed memory at high level.

From Program to Computer The final part of this jigsaw is to know how parallel programs get executed in practical parallel and distributed computers. Recall the nature of the parallel programming constructs introduced earlier; we consider their implementation, in both the run-time software library and the underlying hardware.

From Program to Computer • It is perhaps tempting to think that message-passing somehow ‘belongs’ to distributed memory architecture, and data-sharing ‘belongs’ to shared memory architecture (in other words, that programs in the programming model ‘map’ naturally and efficiently to the ‘belonging’ hardware). • But this is not necessarily the case. Either kind of programming model may be (and has been) implemented on either kind of architecture. The two derivations (from sequential to parallel, in software and in hardware) are completely independent of one another. • The key issue in practice is how much ‘overhead’ is introduced by the implementation of each parallel programming construct on a particular parallel architecture.

The Story So Far:Application-Oriented View • Solving a computational problem involves design and development at several distinct Levels of Abstraction. The totality of issues to be considered well exceeds the capabilities of a normal human being. • At Application Level, the description of the problem to be solved is informal. A primary task in developing a solution is to create a formal (mathematical) application model, or specification. Although formal and abstract, an application model implies computational work that must be done and so ultimately determines the performance that can be achieved in an implementation. • Algorithms are procedures, based on discrete data domains, for solving (approximations to) computational problems. An algorithm is also abstract, although it is generally more clearly related to the computer that will implement it than is the corresponding specification.

The Story So Far:Implementation-Oriented View • Concrete implementation of an algorithm is achieved through the medium of a program, which determines how the discrete data domains inherent in the algorithm will be laid out in the memory of the executing computer, and also defines the operations that will be performed on that data, and their relative execution order. Interest is currently focused on parallel execution using parallel programming languages, based on multiple active processes (with message-passing) or multiple threads (with data-sharing). • Performance is ultimately dictated by the available parallel platform, via the efficiency of its support for processes or threads. Hardware architectures are still evolving, but a clear trend is emerging towards distributed memory structure, with various levels of support for sharing data at the Program Level. • Correctness requires an algorithm which is correct with respect to the specification, AND a correct implementation of the algorithm (as well as the correct operation of computer hardware and network infrastructure).

COMP60611 Fundamentals of Parallel and Distributed Systems