Understanding Parallel Computer Architecture

Parallel Computer Architecture并行计算机体系结构Lecture 2 March 4, 2019 Wu junmin (jmwu@ustc.edu.cn)

Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction Vs. Implementation

Parallel Computer Architecture • Two facets of Computer Architecture: • Defines Critical Abstractions • especially at HW/SW boundary • set of operations and data types these operate on • Organizational structure that realizes these abstraction • Parallel Computer Arch. = Comp. Arch + Communication Arch. • Comm. Architecture has same two facets • communication abstraction • primitives at user/system and hw/sw boundary

Layered Perspective of PCA CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Hardware/Software boundary Communication hardware Physical communication medium

Communication Architecture User/System Interface + Organization • User/System Interface: • Comm. primitives exposed to user-level by hw and system-level sw • Implementation: • Organizational structures that implement the primitives: HW or OS • How optimized are they? How integrated into processing node? • Structure of network • Goals: • Performance • Broad applicability • Programmability • Scalability • Low Cost

Communication Abstraction • User level communication primitives provided • Realizes the programming model • Mapping exists between language primitives of programming model and these primitives • Supported directly by hw, or via OS, or via user sw • Lot of debate about what to support in sw and gap between layers • Today: • Hw/sw interface tends to be flat, i.e. complexity roughly uniform • Compilers and software play important roles as bridges today • Technology trends exert strong influence • Result is convergence in organizational structure • Relatively simple, general purpose communication primitives

Understanding Parallel Architecture • Traditional taxonomies not very useful • Programming models not enough, nor hardware structures • Same one can be supported by radically different architectures • Compilers, libraries, programs • Design of user/system and hardware/software interface • Constrained from above by progr. models and below by technology • Guiding principles provided by layers • What primitives are provided at communication abstraction • How programming models map to these • How they are mapped to hardware

Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction vs. Implementation

Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory History • Parallel architectures tied closely to programming models • Divergent architectures, with no predictable pattern of growth. • Mid 80s renaissance

Programming Model • Conceptualization of the machine that programmer uses in coding applications • How parts cooperate and coordinate their activities • Specifies communication and synchronization operations • Multiprogramming • no communication or synch. at program level • Shared address space • like bulletin board • Message passing • like letters or phone calls, explicit point to point • Data parallel: • more regimented, global actions on data • Several agents perform an action on separate elements of a data set simultaneously and then exchange information globally before continuing en masse. • Implemented with shared address space or message passing

Shared Address Space Model

Shared Memory => Shared Addr. Space • Primitive one to one with Comm. Abstraction • Why its attractive? • Ease of Programming • Memory capacity increased by adding modules • I/O by controllers and devices • Add processors for processing! • For higher-throughput multiprogramming, or parallel programs

Shared Physical Memory • Any processor can directly reference any memory location • Any I/O controller - any memory • Operating system can run on any processor, or all. • OS uses shared memory to coordinate • Communication occurs implicitly as result of loads and stores • What about application processes?

Shared Virtual Address Space • Process = address space plus thread of control • Virtual-to-physical mapping can be established so that processes shared portions of address space. • User-kernel or multiple processes • Multiple threads of control on one address space. • Popular approach to structuring OS’s • Now standard application capability (ex: POSIX threads, Java Thread) • Writes to shared address visible to other threads • Natural extension of uniprocessor model • conventional memory operations for communication • special atomic operations for synchronization • also load/stores

Machine physical address space Virtual address spaces for a collection of processes communicating via shared addresses P p r i v a t e n L o a d P n Common physical addresses P 2 P 1 P 0 S t o r e P p r i v a t e 2 Shared portion of address space P p r i v a t e 1 Private portion of address space P p r i v a t e 0 Structured Shared Address Space • Ad hoc parallelism used in system code • Most parallel applications have structured SAS • Same program on each processor • shared variable X means the same thing to each thread

Shared Memory: 计算π代码段 • * # define N 100000 • main ( ){ • double local，pi=0.0 ，w； • long i ； • A ： w=1. 0/N； • B ： # Pragma Parallel • # Pragma Shared (pi，w) • # Pragma Local (i，local) • { • # Pragma pfor iterate(i=0；N；1) • for (i=0；i<N，i++){ • P： local = (i+0.5)*w； • Q： local=4.0/(1.0+local*local)； • } • C ： # Pragma Critical • pi =pi +local ； • } • D： printf (“pi is % f \ n”，pi *w)； • }

Historical Development • “Mainframe” approach • Motivated by multiprogramming • Extends crossbar used for Mem and I/O • Processor cost-limited => crossbar • Bandwidth scales with p • High incremental cost • use multistage instead (l and bw) • “Minicomputer” approach • Almost all microprocessor systems have bus • Motivated by multiprogramming, TP • Used heavily for parallel computing • Called symmetric multiprocessor (SMP) • Latency larger than for uniprocessor • Bus is bandwidth bottleneck • caching is key: coherence problem • Low incremental cost

Engineering: Intel Pentium Pro Quad • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth

MultiCore

Engineering: SUN Enterprise • Proc + mem card - I/O card • 16 cards of either type • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus

M M M ° ° ° Network Network ° ° ° ° ° ° M M M $ $ $ $ $ $ P P P P P P Scaling Up • Problem is interconnect: cost (crossbar) or bandwidth (bus) • Dance-hall: bandwidth still scalable, but lower cost than crossbar • latencies to memory uniform, but uniformly large • Distributed memory or non-uniform memory access (NUMA) • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) • Caching shared (particularly nonlocal) data? “Dance hall” Distributed memory

Engineering: Cray T3E • Scale up to 1024 processors, 480MB/s links • Memory controller generates request message for non-local references • No hardware mechanism for coherence • SGI Origin etc. provide this

SGI Altix UV 1000

Network ° ° ° M M M $ $ $ P P P Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory

Network ° ° ° M M M $ $ $ P P P Message Passing Architectures • Complete computer as building block, including I/O • Communication via explicit I/O operations • Programming model • direct access only to private address space (local memory), • communication via explicit messages (send/receive) • High-level block diagram • Communication integration? • Mem, I/O, LAN, Cluster • Easier to build and scale than SAS • Programming model more removed from basic hardware operations • Library or OS intervention

Match Receive Y , P , t Addr ess Y Send X, Q, t Addr ess X Local pr ocess Local pr ocess addr ess space addr ess space Pr ocess P Pr ocess Q Message-Passing Abstraction • Sender specifies buffer to be transmitted and receiving process • Recver specifies sending process and application storage to receive into • Memory to memory copy, but need to name processes • Optional tag on send and matching rule on receive • User process names local data and entities in process/tag space too • In simplest form, the send/recv match achieves pairwise synch event • Other variants too • Many overheads: copying, buffer management, protection

Message Passing: 计算π代码段 # define N 100000 • main ( ){ • double local=0.0，pi，w，temp=0.0； • long i ，taskid，numtask； • A： w=1.0/N； • MPI_ Init(&argc，& argv)； • MPI _Comm _rank (MPI_COMM_WORLD，&taskid)； • MPI _Comm _Size (MPI_COMM_WORLD，&numtask)； • B： for (i= taskid； i< N； i=i + numtask){ • P： temp = (i+0.5)*w； • Q： local=4.0/(1.0+temp*temp)+local； • } • C：MPI_Reduce (&local,&pi,1,MPI_Double,MPI_MAX,0,MPI_COMM_WORLD) • D： if (taskid = =0) printf(“pi is % f \ n”，pi* w)； • MPI_Finalize ( ) ； • }

Evolution of Message-Passing Machines • Early machines: FIFO on each link • HW close to prog. Model; • synchronous ops • topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95)

Diminishing Role of Topology • Shift to general links • DMA, enabling non-blocking ops • Buffered by system at destination until recv • Store&forward routing • Diminishing role of topology • Any-to-any pipelined routing • node-network interface dominates communication time • Simplifies programming • Allows richer design space • grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860 H x (T0 + n/B) vs T0 + HD + n/B

Example Intel Paragon

Building on the mainstream: IBM SP-2 • Made out of essentially complete RS6000 workstations • Network interface integrated in I/O bus (bw limited by I/O bus)

Berkeley NOW • 100 Sun Ultra2 workstations • Inteligent network interface • proc + mem • Myrinet Network • 160 MB/s per link • 300 ns per hop

Toward Architectural Convergence • Evolution and role of software have blurred boundary • Send/recv supported on SAS machines via buffers • Can construct global address space on MP (GA -> P | LA) • Page-based (or finer-grained) shared virtual memory • Hardware organization converging too • Tighter NI integration even for MP (low-latency, high-bandwidth) • Hardware SAS passes messages • Even clusters of workstations/SMPs are parallel systems • Emergence of fast system area networks (SAN) • Programming models distinct, but organizations converging • Nodes connected by general network and communication assists • Implementations also converging, at least in high-end machines

Convergence: Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network • Convergence allows lots of innovation, within framework • Integration of assist with node, what operations, how efficiently... • For example

Program Models Serve to impose Structure on Programs

Data Parallel Systems • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor associated with each data element • Architectural model • Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization • Original motivations • Matches simple differential equation solvers • Centralize high cost of instruction fetch/sequencing

Data Parallel: 计算π代码段 • main( ){ • double local [N]，temp [N]，pi，w； • long i，j，t，N=100000； • A： w=1.0/N； • B： forall (i=0；i<N ； i++){ • P： local[i]=(i+0.5)*w • Q： temp[i]=4.0/(1.0+local[i]*local[i])； • } • C： pi = sum (temp)； • D： printf (“pi is % f \ n”，pi * w )； • }

Connection Machine (Tucker, IEEE Computer, Aug. 1988)

Evolution and Convergence • SIMD Popular when cost savings of centralized sequencer high • 60s when CPU was a cabinet • Replaced by vectors in mid-70s • More flexible w.r.t. memory layout and easier to manage • Revived in mid-80s when 32-bit datapath slices just fit on chip • Simple, regular applications have good locality • Programming model converges with SPMD (single program multiple data) • need fast global synchronization • Structured global address space, implemented with either SAS or MP

CM-5 • Repackaged SparcStation • 4 per board • Fat-Tree network • Control network for global synchronization

1 b c e a=(b+1)*(b-c) d=c*e f=a*d + * - * * d a Dataflow graph f Dataflow Architecture • Program is represented by a graph of essential data dependences. • An instruction may execute whenever its data operands are available. • The graph spread arbitrarily over a collection of processors.

Network Program store Token store Network Waiting Matching Instruction Matching Execute Form token Token queue Network Basic execution pipeline • Token: message consists of data and tag of its destination node.

x4 xin x8 x6 x3 x1 x w4 w3 w1 w2 w y3 y2 yin y1 Systolic Architecture • A given algorithm is represented directly as a collection of specialized computational units connected in a regular, space efficient pattern. • Data would move through the system at regular “heartbeats” . Y(i)=w1*x(i)+w2*x(i+1)+w3*x(i+2)+w4*x(i+3) x7 x5 x2 xout xout=x x =xin yout=yin+w*xin yout

Flynn’s Taxonomy • # instruction x # Data

SIMD

MIMD

MISD

Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction vs. Implementation

Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory

Convergence: Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network • Convergence allows lots of innovation, within framework • Integration of assist with node, what operations, how efficiently... • For example

Understanding Parallel Computer Architecture