420 likes | 672 Views
Computer Architecture Guidance. Keio University AMANO, Hideharu hunga@am . ics . keio . ac . jp. Contents. Techniques of two key architectures for future system LSIs Parallel Architectures Reconfigurable Architectures Advanced uni-processor architecture
E N D
ComputerArchitectureGuidance Keio University AMANO, Hideharu hunga@am.ics.keio.ac.jp
Contents Techniques of two key architectures for future system LSIs • Parallel Architectures • Reconfigurable Architectures • Advanced uni-processor architecture →Special Course of Microprocessors (by Prof. Yamasaki, Fall term)
Class • Lecture using Powerpoint: (70mins. ) • The ppt file is uploaded on the web site http://www.am.ics.keio.ac.jp, and you can down load/print before the lecture. • When the file is uploaded, the message is sent to you by E-mail. • Textbook: “Parallel Computers” by H.Amano (Sho-ko-do) • Exercise (20mins.) • Simple design or calculation on design issues
Evaluation • Exercise on Parallel Programming using SCore on RHiNET-2 (50%) • Exercise after every lecture (50%)
ComputerArchitecture1Introduction to Parallel Architectures Keio University AMANO, Hideharu hunga@am.ics.keio.ac.jp
Parallel Architecture A parallel architecture consists of multiple processing units which work simultaneously. • Purposes • Classifications • Terms • Trends
Boundary between Parallel machines and Uniprocessors Uniprocessors • ILP(InstructionLevelParallelism) • A single Program Counter • Parallelism Inside/Between instructions • TLP(TreadLevelParallelism) • Multiple Program Counters • Parallelism between processes and jobs Parallel Machines Definition Hennessy & Petterson’s Computer Architecture: A quantitative approach
Performance improvement Tightly Coupling Increasing of simultaneous issued instructions vs. Tightly coupling Single pipeline Multiple instructions issue Multiple Threads execution On chip implementation Shared memory, Shared register Connecting Multiple Processors
Purposes of providing multiple processors • Performance • A job can be executed quickly with multiple processors • Fault tolerance • If a processing unit is damaged, total system can be available: Redundant systems • Resource sharing • Multiple jobs share memory and/or I/O modules for cost effective processing:Distributed systems • Low power • High performance with Low frequency operation Parallel Architecture: Performance Centric!
Flynn’s Classification • The number of InstructionStream: M(Multiple)/S(Single) • The number of DataStream:M/S • SISD • Uniprocessors(including Super scalar、VLIW) • MISD: Not existing(AnalogComputer) • SIMD • MIMD
Instruction SIMD (Single Instruction StreamsMultiple Data Streams • All Processing Units executes the same instruction • Low degree of flexibility • Illiac-IV/MMX(coarse grain) • CM-2 type(fine grain) Instruction Memory Processing Unit Data memory
Two types of SIMD • Coarse grain:Each node performs floating point numerical operations • ILLIAC-IV,BSP,GF-11 • Multimedia instructions in recent high-end CPUs • Dedicated on-chip approach: NEC’s IMEP • Fine grain:Each node only performs a few bits operations • ICLDAP,CM-2,MP-2 • Image/Signal Processing • Connection Machines extends the application to Artificial Intelligence (CmLisp)
A processing unit of CM-2 Flags A B F OP C Context s c 256bit memory 1bit serial ALU
P P P P P P P P P P P P P P P P Element of CM2 4096 chips = 64KPE Instruction LSI chip Router 4x4 Processor Array 12links 4096 Hypercube connection 256bit x 16 PE RAM
The future of SIMD • Coarse grain SIMD • A large scale supercomputer like Illiac-IV/GF-11 will not revive. • Multi-media instructions will be used in the future. • Special purpose on-chip system will become popular. • Fine grain SIMD • Advantageous to specific applications like image processing • General purpose machines are difficult to be built ex.CM2 → CM5
Each processor executes individual instructions • Synchronization is required • High degree of flexibility • Various structures are possible MIMD Processors Interconnection networks Memory modules (Instructions・Data)
Classification of MIMD machines Structure of shared memory • UMA(UniformMemoryAccessModel) provides shared memory which can be accessed from all processors with the same manner. • NUMA(Non-UniformMemoryAccessModel) provides shared memory but not uniformly accessed. • NORA/NORMA(NoRemoteMemoryAccessModel) provides no shared memory. Communication is done with message passing.
On-chip multiprocessor Chip multiprocessor Single chip multiprocessor UMA • The simplest structure of shared memory machine • The extension of uniprocessors • OS which is an extension for single processor can be used. • Programming is easy. • System size is limited. • Bus connected • Switch connected • A total system can be implemented on a single chip
Snoop Cache Snoop Cache Snoop Cache Snoop Cache PU PU PU An example of UMA:Bus connected MainMemory sharedbus PU SMP(Symmetric MultiProcessor) On chip multiprocessor
Switch connected UMA .... LocalMemory CPU Interface Switch …. MainMemory The gap between switch and bus becomes small
Competitive to WS/PC clusters with Software DSM NUMA • Each processor provides a local memory, and accesses other processors’ memory through the network. • Address translation and cache control often make the hardware structure complicated. • Scalable: • Programs for UMA can run without modification. • The performance is improved as the system size.
Typical structure of NUMA Node 0 0 Node 1 1 Interconnecton Network 2 Node2 3 Logical address space Node 3
Classification of NUMA • Simple NUMA: • Remote memory is not cached. • Simple structure but access cost of remote memory is large. • CC-NUMA:CacheCoherent • Cache consistency is maintained with hardware. • The structure tends to be complicated. • COMA:CacheOnlyMemoryArchitecture • No home memory • Complicated control mechanism
Cray’s T3D: A simple NUMA supercomputer (1993) • Using Alpha 21064
SGIOrigin BristledHypercube MainMemory HubChip Network MainMemory is connected directly with HubChip 1 cluster consists of 2PE.
SGI’s CC-NUMA Origin3000(2000) • Using R12000
... ... ... DDM(DataDiffusionMachine) D ...
The fastest processor is always NORA (except The Earth Simulator) Cluster computing Inter-PU communications NORA/NORMA • No shared memory • Communication is done with message passing • Simple structure but high peak performance Hard for programming
Fujitsu’s NORA AP1000(1990) • Mesh connection • SPARC
Intel’s Paragon XP/S(1991) • Mesh connection • i860
PC Cluster • Beowulf Cluster (NASA’s Beowulf Projects 1994, by Sterling) • Commodity components • TCP/IP • Free software • Others • Commodity components • High performance networks like Myrinet / Infiniband • Dedicated software
Terms(1) • Multiprocessors: • MIMD machines with shared memory • (Strict definition:by Enslow Jr.) • Shared memory • Shared I/O • Distributed OS • Homogeneous • Extended definition: All parallel machines(Wrong usage)
Don’t use if possible Terms(2) • Multicomputer • MIMD machines without shared memory, that is NORA/NORMA • Arraycomputer • A machine consisting of array of processing elements : SIMD • A supercomputer for array calculation (Wrong usage) • Loosely coupled ・ Tightly coupled • Loosely coupled: NORA,Tightly coupled:UMA • But, are NORAs really loosely coupled??
Classification Fine grain SIMD Coarse grain Multiprocessors Stored programming based Bus connected UMA Switch connected UMA Simple NUMA CC-NUMA COMA MIMD NUMA NORA Multicomputers Systolic architecture Data flow architecture Mixed control Demand driven architecture Others
Systolic Architecture Data x Computational Array Data y Data streams are inserted into an array of special purpose computing node in a certain rhythm. Introduced in Reconfigurable Architectures
Data flow machines d e A process is driven by the data c x a b + + x (a+b)x(c+(dxe)) Also introduced in Reconfigurable Systems
Exercise 1 • In this class, a PC cluster RHiNET is used for exercising parallel programming. Is RHiNET classified into Beowulf cluster ? Why do you think so? • If you take this class, send the answer with your name and student number to hunga@am.ics.keio.ac.jp