Other Architectures & Examples

Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1st May, 2006 Anshul Kumar, CSE IITD

Context switching • Delays and poor resource utilization due to - • Data/control hazards • cache misses • waiting for some event • Solution – • context switch to another thread • Context switch mechanism – • operating system - slow • hardware - fast Anshul Kumar, CSE IITD

Multithreaded architecture • Hardware context switching • Models • control flow or hybrid (control flow, data flow) • Granularity • fine grain or coarse grain • Memory organization • shared?, distributed?, cache coherent? • No. of threads • small, medium, large Anshul Kumar, CSE IITD

ILP and Multithreading ILP Coarse MT Fine MT SMT Hennessy and Patterson

Chip level multithreading Executing instructions from multiple threads within one processor chip at the same time. • Multithreading: Interleaved issue of multiple instructions from different threads • Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. • Chip-level multiprocessing (CMP or Multicore): integrate two or more superscalar processors into one chip, each execute one thread independently • Any combination of multithreading/SMT/CMP Wikipedia Anshul Kumar, CSE IITD

Historical Examples Machine Granu- Procs Threads/ Memory Year larity proc HEP from fine max 16 8 active shared 1978 Denelcor 64 max centralized Tera fine max 256 128 distributed 1990 shared Alewife coarse max 512 1 active CC 1990 (MIT) sparcle 3 loaded Anshul Kumar, CSE IITD

Modern examples • Pentium 4 Hyperthreading • MIPS MT 8 cores with 4 threads each • IBM Power 5 dual core, 2 threads each • Ultrasparc T1 fine grained multithreading Anshul Kumar, CSE IITD

HEP Control loop 8 stage pipeline scheduler function unit PSW queue Program memory Matching unit Increment control Registers Operand fetch SFU FU1 FU2 FUn To/from data memory Anshul Kumar, CSE IITD

Control Flow & Data Flow models • Control Flow (von Neumann) • control flows through a sequence of instructions, branches can alter the flow • instructions get data from or put data in memory • explicit parallelism through control operators – fork/join • Data Flow • instructions are triggered by availability of data • data flows from instruction to instruction • explicit parallelism Anshul Kumar, CSE IITD

Dataflow Model A B 1 - + A-B B+1 * R=(A-B)*(B+1) Anshul Kumar, CSE IITD

Dataflow Program - L1: Compute B A L3: L2/2 L2: L3/1 + - B B 1 L4/2 L4/1 L4: A-B * B+1 L6/1 R=(A-B)*(B+1) Anshul Kumar, CSE IITD

Static Dataflow Architecture Activity Store Fetch unit FU1 FU2 FUn Instruction queue Update unit to/from other PEs Anshul Kumar, CSE IITD

Tagged-token dataflow architecture Matching unit Matching store Instruction/ data memory Fetch unit FU1 FU2 FUn Token queue Form token unit to/from other PEs Anshul Kumar, CSE IITD

UMA Examples • Earlier approach : Large number of processors (e.g. Denelcor HEP, NYU Ultracomputer) • Now realized : Good only for small number of processors (e.g. Encore Multimax - 1980’s, SGI Power Challenge - 1990’s) Anshul Kumar, CSE IITD

SGI Power Challenge • 18 MIPS R 8000 • 16 GB RAM, 8-way interleaved • 4 power channel-2, each 320 MB/s (I/O bus) • Power path-2 : split transaction shared bus (256 bit data, 40 bit address) • Snoopy cache coherence protocol Anshul Kumar, CSE IITD

NUMA Examples • BBN TC2000 • IBM RP3 • Hector • Cray T3D Anshul Kumar, CSE IITD

Hector • Hierarchical Structure global ring local rings stations Proc module (P+C+M) I/O module Anshul Kumar, CSE IITD

Hector station station station local ring global ring local ring station station station Station Station bus Station controller Proc module Proc module Proc module I/O module Anshul Kumar, CSE IITD

Cray T3D • Alpha 21064 Proc Cray Y-MP host • upto 128 GB memory • 4x4x4 3D torus - config upto 8x8x8 • 2 PEs in each node Anshul Kumar, CSE IITD

CC-NUMA examples Machine Nodes Mem Cache Net Wisconsin single proc per col bus snoopy bus grid Multicube Aquarius single proc per node snoopy+ bus grid Multimulti directory Stanford cluster per cluster snoopy+ pair of Dash 4 R3000+ directory meshes FPU on bus Stanford single proc per node directory 2D Flash T5+magic chip mesh Convex hyper node per SCI X bar Exemplar 8 PA-RISC hyper node (hyper node) multi rings Magic chip : memory + I/O + network controller Anshul Kumar, CSE IITD

COMA examples • DDM (Data Diffusion Machine) • single bus (split transaction) • can be made hierarchical • KSR 1 • hierarchical rings • distributed directory is a matrix : rows for pages, columns for caches Anshul Kumar, CSE IITD

Distr Mem Arch Examples Machine Comp. Comm. Vec. Switch Topology proc proc proc nCUBE2 custom custom hyper cube iPSC2 i386 yes yes hyper cube Intel i860 i860 custom 2D mesh Paragon Genesis i870 i870 custom 2 level X bar Manna i860 i860 16x16 X bar hierarch. Parsytec P.PC601 T805 C004 3D mesh Transtech i860 T805 C004 variable Paramid IBM SP2 Power2 i860 custom fat tree Meiko SPARC custom Fujitsu custom fat tree C32 Parsys T900 T900 C104 hierarch sw SN9800 Anshul Kumar, CSE IITD

References • D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997. Anshul Kumar, CSE IITD

Other Architectures & Examples