810 likes | 962 Views
Signalling in the Heterogeneous Architecture Multiprocessor Paradigm. Antonio Núñez, Victor Reyes, Tomás Bautista Keynote IUMA, Institute for Applied Microelectronics, ULPGC. Index. MPSoC Architectures -> Hetero MPSoC Communication Architectures -> Split Transport and Signalling Networks
E N D
Signalling in the Heterogeneous Architecture Multiprocessor Paradigm Antonio Núñez, Victor Reyes, Tomás Bautista Keynote IUMA, Institute for Applied Microelectronics, ULPGC A. Nunez
Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez
Technological Forecasts • Moore's Law: number of transistors per chip double every two years • ITRS: GALS NoC SoC MPSoC A. Nunez
Processor to DRAM Performance Gap µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1993 1985 1986 1987 1988 1989 1990 1991 1992 1994 1995 2000 1980 1981 1982 1983 1984 1996 1997 1998 1999 Time A. Nunez
Logic to Memory Area Gap A. Nunez
Logic to Productivity Gap A. Nunez
-> Platform based design -> Communication architectures A. Nunez
Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez
Processor ArchitectureParadigms Cfr. Ungerer et al, Patterson et al, Tenhunnen et al, Computer special issue • Processor/Memory/Switch • Processor- Memory- Communications- dominated systems • Communications architecture • Processor-Mono: Speed-up of a single-threaded application • Advanced superscalar • Trace Cache • Superspeculative • Multiscalar processors • Processor-Multi: Speed-up of multi-threaded applications • Simultaneous multithreading (SMT) • Chip multiprocessors (CMPs) • Memory, Processor-in-Memory, IRAM, others • Network on Chip Patt, Sohi… • Homo • Hetero Many.. Patterson Mihal, Tenhunnen, Goosens A. Nunez
Monoprocessor: Superflow Processor • Fine granularity, data word • The Superflow processor speculates on • instruction flow: two-phase branch predictor combined with trace cache • register data flow: dependence prediction: predict the register value dependence between instructions • source operand value prediction • constant value prediction • value stride prediction: speculate on constant, incremental increases in operand values • dependence prediction predicts inter-instruction dependences • memory data flow: prediction of load values, of load addresses and alias prediction A. Nunez
Com-arch in SuperflowProcessor A. Nunez
Multiscalar Processors • A program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control. • A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task. • The tasks are distributed to a number of parallel PEs within a processor. • Each PE fetches and executes instructions belonging to its assigned task. • The primary constraint: it must preserve the sequential program semantics. A. Nunez
PE 0 A Task A PE 1 B C Data values Task B D PE 2 Task D E PE 3 Task E Multiscalar mode of execution A. Nunez
Com-arch in Multiscalar processor A. Nunez
Multiscalar, Trace and Speculative Multithreaded Processors • Multiscalar: A program is statically partitioned into tasks which are marked by annotations of the CFG. • Trace Processor: Tasks are generated from traces of the trace cache. • Speculative multithreading: Tasks are otherwise dynamically constructed. • Common target: Increase of single-thread program performance by dynamically utilizing thread-level speculation additionally to instruction-level parallelism. • A „thread“ means a „HW thread“ A. Nunez
Multis: Additional utilization of more coarse-grained parallelism • CMPs Chip multiprocessors or multiprocessor chips • integrate two or more complete processors on a single chip, • every functional unit of a processor is duplicated. • SMPs Simultaneous multithreaded processors • store multiple contexts in different register sets on the chip, • the functional units are multiplexed between the threads, • instructions of different contexts are simultaneously executed. A. Nunez
Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Secndary Cache Global Memory CMPs-Homo: Com-arch by shared global memory Global Memory Shared global memory, no caches A. Nunez
Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Secondary Cache Global Memory CMPs-Homo: Com-arch by shared primary cache Shared primary cache A. Nunez
Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory Global Memory CMPs-Homo: Com-arch by global memory, caches Shared caches and memory Shared secondary cache A. Nunez
Com-arch in Hydra: A Single-Chip Multiprocessor Centralized Bus Arbitration Mechanisms A Single Chip CPU 0 CPU 1 CPU 2 CPU 3 Primary I-cache Primary Primary Primary Primary Primary Primary Primary I-cache D-cache I-cache D-cache D-cache I-cache D-cache CPU 0 Memory Controller CPU 1 Memory Controller CPU2 Memory Controller CPU 3 Memory Controller DMA Rambus Memory Off-chip L3 I/O Bus On-chip Secondary Interface Interface Interface Cache DRAM Main Memory Cache SRAM Array I/O Device A. Nunez
Engines Engines CMPs-Hetero: Communications Architecture • Architectures found in today’s heterogeneous processors for platform based design • E.gr. CPU cores, AMBA buses, internal/external shared memories AMBA Bus RISC Core Internal/ External Memory External I/O Shared Bus A. Nunez
Multithreaded Processors • Aim: Latency tolerance • What is the problem? Load access latencies measured on an Alpha Server 4100 SMP with four Alpha 21164 processors are: • 7 cycles for a primary cache miss which hits in the on-chip L2 cache of the 21164 processor, • 21 cycles for a L2 cache miss which hits in the L3 (board-level) cache, • 80 cycles for a miss that is served by the memory, and • 125 cycles for a dirty miss, i.e., a miss that has to be served from another processor's cache memory. A. Nunez
Multithreading • Multithreading • The ability to pursue two or more threads of control in parallel within a processor pipeline. • Advantage: The latencies that arise in the computation of a single instruction stream are filled by computations of another thread. • Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors. A. Nunez
Approaches of Multithreaded Processors • Cycle-by-cycle interleaving • An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle. • Block-interleaving • The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch. • Simultaneous multithreading SMTs • Instructions are simultaneously issued from multiple threads to the FUs of a superscalar processor. • combines a wide issue superscalar instruction issue with multithreading. A. Nunez
Time (process cycles) Context switch Context switch (a) (b) (c) Multithreading versus Non-Multithreading Approaches (a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar A. Nunez
) s e l c y c r o s s e c o r p ( e m i T Issue slots (a) (b) Simultaneous Multithreading(SMT)and Chip Multiprocessors (CMP) (a) SMT (b) CMP A. Nunez
Combining SMT and Multimedia • Start with a wide-issue superscalar general-purpose processor • Enhance by simultaneous multithreading • Enhance by multimedia unit(s) • Enhance by on-chip RAM memory for constants and local variables A. Nunez
The SMT Multimedia Processor A. Nunez
IPC of Maximum Processor Models A. Nunez
Combining CMP-hetero and Multimedia • Start with a general-purpose processor • Enhance by hierarchical-bus com-arch • Enhance by hardware accelerators and copros including multimedia unit(s) • Enhance by on-chip RAM memories for constants, local variables, frames… A. Nunez
Real implementation example: Philips Eclipse architecture instance for video coding A. Nunez
CMP or SMT? • The performance race between SMT and CMP is not yet decided. • CMP is easier to implement, but only SMT has the ability to hide latencies. • A functional partitioning is not easily reached within a SMT processor due to the centralized instruction issue. • A separation of the thread queues is a possible solution, although it does not remove the central instruction issue. • A combination of simultaneous multithreading with the CMP may be superior. • Research: combine SMT or CMP organization with the ability to create threads with compiler support or fully dynamically out of a single thread • thread-level speculation • close to multiscalar A. Nunez
Processor-in-Memory • Technological trends have produced a large and growing gap between processor speed and DRAM access latency. • Today, it takes dozens of cycles for data to travel between the CPU and main memory. • CPU-centric design philosophyhas led to very complex superscalar processors with deep pipelines. • Much of this complexity is devoted to hiding memory access latency. • Memory wall: the phenomenon that access times are increasingly limiting system performance. • Memory-centric design is envisioned for the future A. Nunez
PIM or Intelligent RAM (IRAM) • PIM (processor-in-memory) or IRAM (intelligent RAM) approaches couple processor execution with large, high-bandwidth, on-chip DRAM banks. • PIM or IRAM merge processor and memory into a single chip. • Advantages: • The processor-DRAM gap in access speed increases in future. PIM provides higher bandwidth and lower latency for (on-chip-)memory accesses. • DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches. • On-chip memory may be treated as main memory - in contrast to a cache which is just a redundant memory copy. • PIM decreases energy consumption in the memory system due to the reduction of off-chip accesses. • VIRAM, CODE A. Nunez
V-IRAM-2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB I/O I/O I/O I/O 8 x 64 or 16 x 32 or 32 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 8K I cache 8K D cache 8 x 64 8 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 8 x 64 8 x 64 8 x 64 8 x 64 8 x 64 … … … … … … … … … … M M M M M M M M M M A. Nunez
DSP PE Array NoC Processor Architecture • Network-on-chip, specialized PEs, advanced interconnect technologies • Will use packet network architectures in 2010 On-Chip Memory PE External Memory Switch Node External I/O Packet Network Controller PE Switch Node PE PE PE A. Nunez
Processing Element PE PE switch bridge $ MEM PE $ MEM Processing Element switch NoC Mescal Communication Architecture General Paradigm • Mescal Communication Architecture is a general, coarse-grained on-chip interconnection scheme for various system components such as Processing Elements, memory and other communicating elements. A. Nunez
NoC Communication Architecture A. Nunez
NoC: Example for a bus A. Nunez
Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez
Todays Communication ArchitectureParadigms: Topology • Single and Shared Transport and Signalling Channel • p2p • Bus • Hierarchical bus • Switch • Crossbar • Multistage… • Ring • Trees • Network • Circuit sw • Packet sw w/o connection • Packet sw w connection.. A. Nunez
Todays Communication ArchitectureParadigms: Topology • Split Transport and Signalling • Transport • Topology (bus, h-bus, switch, ring, network…) • Signalling (Addresses and routing, services, synchronisms) • Associated channel • Topology • Common channel • Topology… • Protocol layer stack: software and process view of the generation of hardware signalling requires mapping onto actual interfaces A. Nunez
Todays Communications ArchitectureParadigms: Bandwidth • Application Granularity • Transport Granularity • Fine grain • Medium grain • Coarse grain • Bus sizes, transfer sizes • Traffic Characterization • Traffic Characterization • E.gr. Streaming, burstiness, interval requests, space-time distribution A. Nunez
Todays Communications Architecture Paradigms: Protocols • Protocols • High level signalling primitives mapping • Communications to architecture mapping • Access policies mapping, priorities, static, dynamic • Traffic and flow control • Burstiness • Request Intervals • Concurrency A. Nunez
Todays Communications ArchitectureParadigms: Signalling • Addressing, routing info • Service info • Hand-shake and command sync strobes • High level signalling primitives mapping • Communications to architecture mapping • Access policies mapping, priorities, static, dynamic • Traffic and flow control • Burstiness • Request Intervals • Concurrency • Streaming ... A. Nunez
Com-arch Modelling: Ptolemy-MescalUCBerkeley PtolemyI&II, Mescal, UCSD-Dey, PR-Vissers, Goosens, Lippen.., TIMA-Jerraya.. • Components for channels: • Synchronous digital bus (shared or point-to-point) • ARM AMBA bus • IBM CoreConnect bus • Analog channel • Actors encapsulate the physical layer • Each actor has a common interface to make experimentation possible • Ptolemy actor interface is a higher level than the channel’s actual electrical interface A. Nunez