Signalling in the Heterogeneous Architecture Multiprocessor Paradigm

Signalling in the Heterogeneous Architecture Multiprocessor Paradigm Antonio Núñez, Victor Reyes, Tomás Bautista Keynote IUMA, Institute for Applied Microelectronics, ULPGC A. Nunez

Index • MPSoC Architectures -> Hetero MPSoC • Communication Architectures -> Split Transport and Signalling Networks • Previous and Related work • Our SystemC Based Modelling Approach • Experiments • Conclusions A. Nunez

A. Nunez

Technological Forecasts • Moore's Law: number of transistors per chip double every two years • ITRS: GALS NoC SoC MPSoC A. Nunez

A. Nunez

Processor to DRAM Performance Gap µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1993 1985 1986 1987 1988 1989 1990 1991 1992 1994 1995 2000 1980 1981 1982 1983 1984 1996 1997 1998 1999 Time A. Nunez

Logic to Memory Area Gap A. Nunez

Logic to Productivity Gap A. Nunez

-> Platform based design -> Communication architectures A. Nunez

Processor ArchitectureParadigms Cfr. Ungerer et al, Patterson et al, Tenhunnen et al, Computer special issue • Processor/Memory/Switch • Processor- Memory- Communications- dominated systems • Communications architecture • Processor-Mono: Speed-up of a single-threaded application • Advanced superscalar • Trace Cache • Superspeculative • Multiscalar processors • Processor-Multi: Speed-up of multi-threaded applications • Simultaneous multithreading (SMT) • Chip multiprocessors (CMPs) • Memory, Processor-in-Memory, IRAM, others • Network on Chip Patt, Sohi… • Homo • Hetero Many.. Patterson Mihal, Tenhunnen, Goosens A. Nunez

Monoprocessor: Superflow Processor • Fine granularity, data word • The Superflow processor speculates on • instruction flow: two-phase branch predictor combined with trace cache • register data flow: dependence prediction: predict the register value dependence between instructions • source operand value prediction • constant value prediction • value stride prediction: speculate on constant, incremental increases in operand values • dependence prediction predicts inter-instruction dependences • memory data flow: prediction of load values, of load addresses and alias prediction A. Nunez

Com-arch in SuperflowProcessor A. Nunez

Multiscalar Processors • A program is represented as a control flow graph (CFG), where basic blocks are nodes, and arcs represent flow of control. • A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task. • The tasks are distributed to a number of parallel PEs within a processor. • Each PE fetches and executes instructions belonging to its assigned task. • The primary constraint: it must preserve the sequential program semantics. A. Nunez

PE 0 A Task A PE 1 B C Data values Task B D PE 2 Task D E PE 3 Task E Multiscalar mode of execution A. Nunez

Com-arch in Multiscalar processor A. Nunez

Multiscalar, Trace and Speculative Multithreaded Processors • Multiscalar: A program is statically partitioned into tasks which are marked by annotations of the CFG. • Trace Processor: Tasks are generated from traces of the trace cache. • Speculative multithreading: Tasks are otherwise dynamically constructed. • Common target: Increase of single-thread program performance by dynamically utilizing thread-level speculation additionally to instruction-level parallelism. • A „thread“ means a „HW thread“ A. Nunez

Multis: Additional utilization of more coarse-grained parallelism • CMPs Chip multiprocessors or multiprocessor chips • integrate two or more complete processors on a single chip, • every functional unit of a processor is duplicated. • SMPs Simultaneous multithreaded processors • store multiple contexts in different register sets on the chip, • the functional units are multiplexed between the threads, • instructions of different contexts are simultaneously executed. A. Nunez

Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Secndary Cache Global Memory CMPs-Homo: Com-arch by shared global memory Global Memory Shared global memory, no caches A. Nunez

Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Secondary Cache Global Memory CMPs-Homo: Com-arch by shared primary cache Shared primary cache A. Nunez

Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Pro- cessor Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory Global Memory CMPs-Homo: Com-arch by global memory, caches Shared caches and memory Shared secondary cache A. Nunez

Com-arch in Hydra: A Single-Chip Multiprocessor Centralized Bus Arbitration Mechanisms A Single Chip CPU 0 CPU 1 CPU 2 CPU 3 Primary I-cache Primary Primary Primary Primary Primary Primary Primary I-cache D-cache I-cache D-cache D-cache I-cache D-cache CPU 0 Memory Controller CPU 1 Memory Controller CPU2 Memory Controller CPU 3 Memory Controller DMA Rambus Memory Off-chip L3 I/O Bus On-chip Secondary Interface Interface Interface Cache DRAM Main Memory Cache SRAM Array I/O Device A. Nunez

Engines Engines CMPs-Hetero: Communications Architecture • Architectures found in today’s heterogeneous processors for platform based design • E.gr. CPU cores, AMBA buses, internal/external shared memories AMBA Bus RISC Core Internal/ External Memory External I/O Shared Bus A. Nunez

CMPs-Hetero: Communications Architecture, Arbiters A. Nunez

Multithreaded Processors • Aim: Latency tolerance • What is the problem? Load access latencies measured on an Alpha Server 4100 SMP with four Alpha 21164 processors are: • 7 cycles for a primary cache miss which hits in the on-chip L2 cache of the 21164 processor, • 21 cycles for a L2 cache miss which hits in the L3 (board-level) cache, • 80 cycles for a miss that is served by the memory, and • 125 cycles for a dirty miss, i.e., a miss that has to be served from another processor's cache memory. A. Nunez

Multithreading • Multithreading • The ability to pursue two or more threads of control in parallel within a processor pipeline. • Advantage: The latencies that arise in the computation of a single instruction stream are filled by computations of another thread. • Multithreaded processors are able to bridge latencies by switching to another thread of control - in contrast to chip multiprocessors. A. Nunez

Approaches of Multithreaded Processors • Cycle-by-cycle interleaving • An instruction of another thread is fetched and fed into the execution pipeline at each processor cycle. • Block-interleaving • The instructions of a thread are executed successively until an event occurs that may cause latency. This event induces a context switch. • Simultaneous multithreading SMTs • Instructions are simultaneously issued from multiple threads to the FUs of a superscalar processor. • combines a wide issue superscalar instruction issue with multithreading. A. Nunez

Time (process cycles) Context switch Context switch (a) (b) (c) Multithreading versus Non-Multithreading Approaches (a) single-threaded scalar (b) cycle-by-cycle interleaving multithreaded scalar (c) block interleaving multithreaded scalar A. Nunez

) s e l c y c r o s s e c o r p ( e m i T Issue slots (a) (b) Simultaneous Multithreading(SMT)and Chip Multiprocessors (CMP) (a) SMT (b) CMP A. Nunez

Combining SMT and Multimedia • Start with a wide-issue superscalar general-purpose processor • Enhance by simultaneous multithreading • Enhance by multimedia unit(s) • Enhance by on-chip RAM memory for constants and local variables A. Nunez

The SMT Multimedia Processor A. Nunez

IPC of Maximum Processor Models A. Nunez

Combining CMP-hetero and Multimedia • Start with a general-purpose processor • Enhance by hierarchical-bus com-arch • Enhance by hardware accelerators and copros including multimedia unit(s) • Enhance by on-chip RAM memories for constants, local variables, frames… A. Nunez

Real implementation example: Philips Eclipse architecture instance for video coding A. Nunez

CMP or SMT? • The performance race between SMT and CMP is not yet decided. • CMP is easier to implement, but only SMT has the ability to hide latencies. • A functional partitioning is not easily reached within a SMT processor due to the centralized instruction issue. • A separation of the thread queues is a possible solution, although it does not remove the central instruction issue. • A combination of simultaneous multithreading with the CMP may be superior. • Research: combine SMT or CMP organization with the ability to create threads with compiler support or fully dynamically out of a single thread • thread-level speculation • close to multiscalar A. Nunez

Processor-in-Memory • Technological trends have produced a large and growing gap between processor speed and DRAM access latency. • Today, it takes dozens of cycles for data to travel between the CPU and main memory. • CPU-centric design philosophyhas led to very complex superscalar processors with deep pipelines. • Much of this complexity is devoted to hiding memory access latency. • Memory wall: the phenomenon that access times are increasingly limiting system performance. • Memory-centric design is envisioned for the future A. Nunez

PIM or Intelligent RAM (IRAM) • PIM (processor-in-memory) or IRAM (intelligent RAM) approaches couple processor execution with large, high-bandwidth, on-chip DRAM banks. • PIM or IRAM merge processor and memory into a single chip. • Advantages: • The processor-DRAM gap in access speed increases in future. PIM provides higher bandwidth and lower latency for (on-chip-)memory accesses. • DRAM can accommodate 30 to 50 times more data than the same chip area devoted to caches. • On-chip memory may be treated as main memory - in contrast to a cache which is just a redundant memory copy. • PIM decreases energy consumption in the memory system due to the reduction of off-chip accesses. • VIRAM, CODE A. Nunez

V-IRAM-2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB I/O I/O I/O I/O 8 x 64 or 16 x 32 or 32 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 8K I cache 8K D cache 8 x 64 8 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 8 x 64 8 x 64 8 x 64 8 x 64 8 x 64 … … … … … … … … … … M M M M M M M M M M A. Nunez

DSP PE Array NoC Processor Architecture • Network-on-chip, specialized PEs, advanced interconnect technologies • Will use packet network architectures in 2010 On-Chip Memory PE External Memory Switch Node External I/O Packet Network Controller PE Switch Node PE PE PE A. Nunez

Processing Element PE PE switch bridge $ MEM PE $ MEM Processing Element switch NoC Mescal Communication Architecture General Paradigm • Mescal Communication Architecture is a general, coarse-grained on-chip interconnection scheme for various system components such as Processing Elements, memory and other communicating elements. A. Nunez

NoC Mescal Abstract System Architecture A. Nunez

NoC Communication Architecture A. Nunez

NoC: Example for a bus A. Nunez

Todays Communication ArchitectureParadigms: Topology • Single and Shared Transport and Signalling Channel • p2p • Bus • Hierarchical bus • Switch • Crossbar • Multistage… • Ring • Trees • Network • Circuit sw • Packet sw w/o connection • Packet sw w connection.. A. Nunez

Todays Communication ArchitectureParadigms: Topology • Split Transport and Signalling • Transport • Topology (bus, h-bus, switch, ring, network…) • Signalling (Addresses and routing, services, synchronisms) • Associated channel • Topology • Common channel • Topology… • Protocol layer stack: software and process view of the generation of hardware signalling requires mapping onto actual interfaces A. Nunez

Todays Communications ArchitectureParadigms: Bandwidth • Application Granularity • Transport Granularity • Fine grain • Medium grain • Coarse grain • Bus sizes, transfer sizes • Traffic Characterization • Traffic Characterization • E.gr. Streaming, burstiness, interval requests, space-time distribution A. Nunez

Todays Communications Architecture Paradigms: Protocols • Protocols • High level signalling primitives mapping • Communications to architecture mapping • Access policies mapping, priorities, static, dynamic • Traffic and flow control • Burstiness • Request Intervals • Concurrency A. Nunez

Todays Communications ArchitectureParadigms: Signalling • Addressing, routing info • Service info • Hand-shake and command sync strobes • High level signalling primitives mapping • Communications to architecture mapping • Access policies mapping, priorities, static, dynamic • Traffic and flow control • Burstiness • Request Intervals • Concurrency • Streaming ... A. Nunez

Com-arch Modelling: Ptolemy-MescalUCBerkeley PtolemyI&II, Mescal, UCSD-Dey, PR-Vissers, Goosens, Lippen.., TIMA-Jerraya.. • Components for channels: • Synchronous digital bus (shared or point-to-point) • ARM AMBA bus • IBM CoreConnect bus • Analog channel • Actors encapsulate the physical layer • Each actor has a common interface to make experimentation possible • Ptolemy actor interface is a higher level than the channel’s actual electrical interface A. Nunez

Signalling in the Heterogeneous Architecture Multiprocessor Paradigm

Signalling in the Heterogeneous Architecture Multiprocessor Paradigm

Presentation Transcript

Multiprocessor Architecture for Image processing

Signalling

Multiprocessor Architecture

Introduction to Heterogeneous System Architecture (HSA)

Signalling

Signalling

Signalling

Signalling

Multiprocessor Architecture for Image Processing

The NUMAchine Multiprocessor

Signalling

A Heterogeneous Lightweight Multithreaded Architecture

Reactive Paradigm – Overview Subsumption Architecture

Chapter 4 Multiprocessor architecture

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems

X Paradigm layered architecture

Heterogeneous Chip Multiprocessor Design for Virtual Machines

Heterogeneous Missions Accessibility Context and Architecture

Mapping the Data Warehouse to a Multiprocessor Architecture

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

A New Architecture for Heterogeneous Networking