(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts

(Superficial!) Review ofUniprocessor ArchitectureParallel Architectures andRelated concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign Department of Computer Science

Early machines • We will present a series of idealized and simplified models • Read more about the real models in architecture textbooks • official prereq: cs232, cs333 • The idea here to review the concepts and define our vocabulary Processor Location 1 Location 0 Memory Location k

Early machines • Early machines: Complex instruction sets, (lets say) no registers • Processor can access any memory location equally fast • Instructions: • Operations: Add L1, L2, L3 (Add contents of Location L1 to that of Location L2, and store results in L3.) • Branching: Branch to L4 (Note that some locations store program instructions), • Coonditional Branching: If (L1>L2) goto L3 Processor Location 1 Location 0 Memory Location k

Registers • Processors are faster than memory • they can deal with data within the processor much faster • So, create some locations in processor for storing data • Called registers; Often with a special register called Accumulator • Now we need new instructions for dealing with data in registers: • Data movement instructions • Move from register to memory, memory to register, register to register, and memory to memory • Computation instructions: • In addition to the previous ones, we now add instructions to allow one or more operands being a register Processor registers CPU Memory

Load-Store architectures (RISC) • Do not allow memory locations to be operands • For computations as well as control instructions • Only instructions to reference memory are: • Load R, L # move contents of L into register R • Store R, L # move contents of register R into memory location L • Notice that the number of instructions is now dramatically reduced • Further, allow only relatively simple instructions to do register-to-register operations • More complex operations implemented in software • Compiler has a bigger responsibility now

Caches • The processor still has to wait for data from memory • I.e. Load and Store instructions are slower • Although more often the CPU is executing register-only instructions • Load and store latency • Dictionary meaning: latency is the delay between stimulus and response • OR: delay between a data-transfer instruction and beginning of data transfer • But, faster SRAM memory is available (although expensive) • Idea: just like registers, put some more of data in faster memory • Which data?? • Principle of locality: (empirical observation) • Data accessed correlates with past accesses, spatially and temporarily • Without this, caches will be worthless (unless most data fits in cache)

Caches Processor still issues load and store instructions as before, but the cache controller intercepts the requests, and if the location has been cached, deals with it using cache Data transfer between cache and memory is not seen by the processor Processor Cache controller Cache Memory

Cache Issues • Level 2 cache • Cache lines • Bring a bunch of data “at once” : • exploit spatial locality • block transfers are faster • 64-128 byte cache lines typical • Trade-off: or why larger and large cache lines aren’t good either

Cache blocks and Cache Lines A cache block is a physical part of the cache. A cache line is a section of the address space. Aline is brought into a cache block. Of course, line-size and block-size are the same. Processor block Cache controller Cache L1 Memory L1

Cache Management • How is cache managed? • Its job: given an address, find if it is cache, and return contents if so. • Also, write data back to memory when needed • and bring data from the memory when needed • Ideally, a fully associative cache will be good • Keep cache lines anywhere in the physical cache • But looking up is hard

Cache management • Alternative scheme: • Each cache line (I.e. address) has exactly one place in the cache memory where it can be stored. • Of course, there are more than one cache lines that will have the same area of cache memory as their possible target • Why? • Only one cache line can live inside a cache block at a time • If you want to bring in a new one, the old one must be “emptied” • A tradeoff: set-associative caches • Have each line map to more than 1 (say 4) physical locations

Parallel Machines: an abstract introduction • Our main focus will be on three kinds of machines • Bus-based shared memory machines • Scalable shared memory machines • Cache coherent • Hardware support for remote memory access • Distributed memory machines

Bus based machines Mem0 Mem1 Memk PE0 PE1 PE N-1

Bus based machines • Any processor can access any memory location • Read and write • Bus bandwidth is a limiting factor • Also, how do you deal with 2 processors changing the same data? • Locks (more on this later)

Scalable shared memory m/cs PE0 PE0 PE0 Interconnection Network with support for remote memory access Mem0 Mem0 Mem0 Not popular, as all data is slow to access

Distributed memory m/cs PE0 PEp PE1 Mem0 Memp Mem1 Interconnection Network

Introducing caches into the picture! • Now, we have more complex problems : • can’t be fixed by locks alone: • copy of the same variables in two different caches may contain different values. • Cache controller must do more Mem0 Mem1 Mem p-1 cache cache cache PE0 PE1 PE p-1

Distributed memory m/cs PE0 Pep-1 PE1 cache cache cache Mem0 Memp-1 Mem1 Interconnection Network

Writing parallel programs • Programming model • How should a programmer view the parallel machine? • Sequential programming: von Neumann model • Parallel programming models: • Shared memory (Shared address space) model • Message passing model • Shared Objects model

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts