730 likes | 750 Views
This lecture provides an overview of parallel computer architecture, including communication architecture, major programming models, and the general structure and fundamental issues. It discusses the abstraction versus implementation in parallel computer architecture and the two facets of computer architecture. It also covers communication architecture, layered perspective, CAD database, scientific modeling, parallel applications, and more.
E N D
Parallel Computer Architecture并行计算机体系结构Lecture 2 March 4, 2019 Wu junmin (jmwu@ustc.edu.cn)
Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction Vs. Implementation
Parallel Computer Architecture • Two facets of Computer Architecture: • Defines Critical Abstractions • especially at HW/SW boundary • set of operations and data types these operate on • Organizational structure that realizes these abstraction • Parallel Computer Arch. = Comp. Arch + Communication Arch. • Comm. Architecture has same two facets • communication abstraction • primitives at user/system and hw/sw boundary
Layered Perspective of PCA CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Hardware/Software boundary Communication hardware Physical communication medium
Communication Architecture User/System Interface + Organization • User/System Interface: • Comm. primitives exposed to user-level by hw and system-level sw • Implementation: • Organizational structures that implement the primitives: HW or OS • How optimized are they? How integrated into processing node? • Structure of network • Goals: • Performance • Broad applicability • Programmability • Scalability • Low Cost
Communication Abstraction • User level communication primitives provided • Realizes the programming model • Mapping exists between language primitives of programming model and these primitives • Supported directly by hw, or via OS, or via user sw • Lot of debate about what to support in sw and gap between layers • Today: • Hw/sw interface tends to be flat, i.e. complexity roughly uniform • Compilers and software play important roles as bridges today • Technology trends exert strong influence • Result is convergence in organizational structure • Relatively simple, general purpose communication primitives
Understanding Parallel Architecture • Traditional taxonomies not very useful • Programming models not enough, nor hardware structures • Same one can be supported by radically different architectures • Compilers, libraries, programs • Design of user/system and hardware/software interface • Constrained from above by progr. models and below by technology • Guiding principles provided by layers • What primitives are provided at communication abstraction • How programming models map to these • How they are mapped to hardware
Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction vs. Implementation
Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory History • Parallel architectures tied closely to programming models • Divergent architectures, with no predictable pattern of growth. • Mid 80s renaissance
Programming Model • Conceptualization of the machine that programmer uses in coding applications • How parts cooperate and coordinate their activities • Specifies communication and synchronization operations • Multiprogramming • no communication or synch. at program level • Shared address space • like bulletin board • Message passing • like letters or phone calls, explicit point to point • Data parallel: • more regimented, global actions on data • Several agents perform an action on separate elements of a data set simultaneously and then exchange information globally before continuing en masse. • Implemented with shared address space or message passing
Shared Memory => Shared Addr. Space • Primitive one to one with Comm. Abstraction • Why its attractive? • Ease of Programming • Memory capacity increased by adding modules • I/O by controllers and devices • Add processors for processing! • For higher-throughput multiprogramming, or parallel programs
Shared Physical Memory • Any processor can directly reference any memory location • Any I/O controller - any memory • Operating system can run on any processor, or all. • OS uses shared memory to coordinate • Communication occurs implicitly as result of loads and stores • What about application processes?
Shared Virtual Address Space • Process = address space plus thread of control • Virtual-to-physical mapping can be established so that processes shared portions of address space. • User-kernel or multiple processes • Multiple threads of control on one address space. • Popular approach to structuring OS’s • Now standard application capability (ex: POSIX threads, Java Thread) • Writes to shared address visible to other threads • Natural extension of uniprocessor model • conventional memory operations for communication • special atomic operations for synchronization • also load/stores
Machine physical address space Virtual address spaces for a collection of processes communicating via shared addresses P p r i v a t e n L o a d P n Common physical addresses P 2 P 1 P 0 S t o r e P p r i v a t e 2 Shared portion of address space P p r i v a t e 1 Private portion of address space P p r i v a t e 0 Structured Shared Address Space • Ad hoc parallelism used in system code • Most parallel applications have structured SAS • Same program on each processor • shared variable X means the same thing to each thread
Shared Memory: 计算π代码段 • * # define N 100000 • main ( ){ • double local,pi=0.0 ,w; • long i ; • A : w=1. 0/N; • B : # Pragma Parallel • # Pragma Shared (pi,w) • # Pragma Local (i,local) • { • # Pragma pfor iterate(i=0;N;1) • for (i=0;i<N,i++){ • P: local = (i+0.5)*w; • Q: local=4.0/(1.0+local*local); • } • C : # Pragma Critical • pi =pi +local ; • } • D: printf (“pi is % f \ n”,pi *w); • }
Historical Development • “Mainframe” approach • Motivated by multiprogramming • Extends crossbar used for Mem and I/O • Processor cost-limited => crossbar • Bandwidth scales with p • High incremental cost • use multistage instead (l and bw) • “Minicomputer” approach • Almost all microprocessor systems have bus • Motivated by multiprogramming, TP • Used heavily for parallel computing • Called symmetric multiprocessor (SMP) • Latency larger than for uniprocessor • Bus is bandwidth bottleneck • caching is key: coherence problem • Low incremental cost
Engineering: Intel Pentium Pro Quad • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth
Engineering: SUN Enterprise • Proc + mem card - I/O card • 16 cards of either type • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus
M M M ° ° ° Network Network ° ° ° ° ° ° M M M $ $ $ $ $ $ P P P P P P Scaling Up • Problem is interconnect: cost (crossbar) or bandwidth (bus) • Dance-hall: bandwidth still scalable, but lower cost than crossbar • latencies to memory uniform, but uniformly large • Distributed memory or non-uniform memory access (NUMA) • Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) • Caching shared (particularly nonlocal) data? “Dance hall” Distributed memory
Engineering: Cray T3E • Scale up to 1024 processors, 480MB/s links • Memory controller generates request message for non-local references • No hardware mechanism for coherence • SGI Origin etc. provide this
Network ° ° ° M M M $ $ $ P P P Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory
Network ° ° ° M M M $ $ $ P P P Message Passing Architectures • Complete computer as building block, including I/O • Communication via explicit I/O operations • Programming model • direct access only to private address space (local memory), • communication via explicit messages (send/receive) • High-level block diagram • Communication integration? • Mem, I/O, LAN, Cluster • Easier to build and scale than SAS • Programming model more removed from basic hardware operations • Library or OS intervention
Match Receive Y , P , t Addr ess Y Send X, Q, t Addr ess X Local pr ocess Local pr ocess addr ess space addr ess space Pr ocess P Pr ocess Q Message-Passing Abstraction • Sender specifies buffer to be transmitted and receiving process • Recver specifies sending process and application storage to receive into • Memory to memory copy, but need to name processes • Optional tag on send and matching rule on receive • User process names local data and entities in process/tag space too • In simplest form, the send/recv match achieves pairwise synch event • Other variants too • Many overheads: copying, buffer management, protection
Message Passing: 计算π代码段 # define N 100000 • main ( ){ • double local=0.0,pi,w,temp=0.0; • long i ,taskid,numtask; • A: w=1.0/N; • MPI_ Init(&argc,& argv); • MPI _Comm _rank (MPI_COMM_WORLD,&taskid); • MPI _Comm _Size (MPI_COMM_WORLD,&numtask); • B: for (i= taskid; i< N; i=i + numtask){ • P: temp = (i+0.5)*w; • Q: local=4.0/(1.0+temp*temp)+local; • } • C:MPI_Reduce (&local,&pi,1,MPI_Double,MPI_MAX,0,MPI_COMM_WORLD) • D: if (taskid = =0) printf(“pi is % f \ n”,pi* w); • MPI_Finalize ( ) ; • }
Evolution of Message-Passing Machines • Early machines: FIFO on each link • HW close to prog. Model; • synchronous ops • topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95)
Diminishing Role of Topology • Shift to general links • DMA, enabling non-blocking ops • Buffered by system at destination until recv • Store&forward routing • Diminishing role of topology • Any-to-any pipelined routing • node-network interface dominates communication time • Simplifies programming • Allows richer design space • grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860 H x (T0 + n/B) vs T0 + HD + n/B
Building on the mainstream: IBM SP-2 • Made out of essentially complete RS6000 workstations • Network interface integrated in I/O bus (bw limited by I/O bus)
Berkeley NOW • 100 Sun Ultra2 workstations • Inteligent network interface • proc + mem • Myrinet Network • 160 MB/s per link • 300 ns per hop
Toward Architectural Convergence • Evolution and role of software have blurred boundary • Send/recv supported on SAS machines via buffers • Can construct global address space on MP (GA -> P | LA) • Page-based (or finer-grained) shared virtual memory • Hardware organization converging too • Tighter NI integration even for MP (low-latency, high-bandwidth) • Hardware SAS passes messages • Even clusters of workstations/SMPs are parallel systems • Emergence of fast system area networks (SAN) • Programming models distinct, but organizations converging • Nodes connected by general network and communication assists • Implementations also converging, at least in high-end machines
Convergence: Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network • Convergence allows lots of innovation, within framework • Integration of assist with node, what operations, how efficiently... • For example
Data Parallel Systems • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor associated with each data element • Architectural model • Array of many simple, cheap processors with little memory each • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization • Original motivations • Matches simple differential equation solvers • Centralize high cost of instruction fetch/sequencing
Data Parallel: 计算π代码段 • main( ){ • double local [N],temp [N],pi,w; • long i,j,t,N=100000; • A: w=1.0/N; • B: forall (i=0;i<N ; i++){ • P: local[i]=(i+0.5)*w • Q: temp[i]=4.0/(1.0+local[i]*local[i]); • } • C: pi = sum (temp); • D: printf (“pi is % f \ n”,pi * w ); • }
Connection Machine (Tucker, IEEE Computer, Aug. 1988)
Evolution and Convergence • SIMD Popular when cost savings of centralized sequencer high • 60s when CPU was a cabinet • Replaced by vectors in mid-70s • More flexible w.r.t. memory layout and easier to manage • Revived in mid-80s when 32-bit datapath slices just fit on chip • Simple, regular applications have good locality • Programming model converges with SPMD (single program multiple data) • need fast global synchronization • Structured global address space, implemented with either SAS or MP
CM-5 • Repackaged SparcStation • 4 per board • Fat-Tree network • Control network for global synchronization
1 b c e a=(b+1)*(b-c) d=c*e f=a*d + * - * * d a Dataflow graph f Dataflow Architecture • Program is represented by a graph of essential data dependences. • An instruction may execute whenever its data operands are available. • The graph spread arbitrarily over a collection of processors.
Network Program store Token store Network Waiting Matching Instruction Matching Execute Form token Token queue Network Basic execution pipeline • Token: message consists of data and tag of its destination node.
x4 xin x8 x6 x3 x1 x w4 w3 w1 w2 w y3 y2 yin y1 Systolic Architecture • A given algorithm is represented directly as a collection of specialized computational units connected in a regular, space efficient pattern. • Data would move through the system at regular “heartbeats” . Y(i)=w1*x(i)+w2*x(i+1)+w3*x(i+2)+w4*x(i+3) x7 x5 x2 xout xout=x x =xin yout=yin+w*xin yout
Flynn’s Taxonomy • # instruction x # Data
Overview • Understanding Communication Architecture • Major programming models • General structure and fundamental issues • Abstraction vs. Implementation
Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory
Convergence: Generic Parallel Architecture • Node: processor(s), memory system, plus communication assist • Network interface and communication controller • Scalable network • Convergence allows lots of innovation, within framework • Integration of assist with node, what operations, how efficiently... • For example