CA406

CA406 Computer Architecture Networks

Data Flow - Summary • Fine-Grain Dataflow • Suffered from comms network overload! • Coarse-Grain Dataflow • Monsoon ... • Overtaken by commercial technology!! • A sad “fact-of-life” • It’s almost impossible to generate the fundsfor non-”mainstream” computer architecture research • $n x 108 required L • Non-mainstream = interesting!

Data Flow - Summary • As a software model … • Functional languages • Dataflow in a different guise! • Theoretically • important • Practically? • Inefficient ( = slow!!) • ….. Ask your CS colleagues! • Cilk - based on C • Used on CIIPS Myrmidons • Uses a dataflow model • Threads become ready for execution when their data is generated • Message passing efficiency • Without explicit data transfer & synchronisation!

Networks • Network Topology (or shape) • Vital to efficient parallel algorithms • Communication is the limiting factor! • Ideal • Cross-bar • Any-to-any • Non-blocking • Except two sources to same receiver • Realisable • But only for limited order (number of ports)

Networks • Cross-bars • Achilles • 8 x 8 • Full duplex • Simultaneous Input and Outputat each port • 32 bit data-path • Target : 1Gbyte / second total throughput • but we needed the 3-D arrangement to achieve • bandwidth • high order

Networks • Cross-bars • Achilles • Hardwarealmost trivial! • Single FPGAon each level • Programmable • VHDL Models • Several topologies • Just by changing thesoftware!

Networks - More than 8 PEs • Simple • Use 2 8x8 routers! but…. This link gets a lot of traffic!

Networks - Fat tree • Problem: • High-traffic links between PEs can become a bottleneck • Solution: Fat-tree • Links higher up the tree are “fatter” • Sustainable bandwidth between all PEs is the same

Networks - Performance Metrics • Metrics for comparing network topologies • Diameter • Maximum distance between any pair of nodes • Determines latency • Bisection Bandwidth • Aggregate bandwidth over any “cut”which divides the network in half • Determines throughput • Crossbar • Diameter: 1 • Every PE is directly connected to routerso a single “hop” suffices • Bisection Bandwidth: b bytes/sec • b is the bandwidth of a single link

Networks - Performance Metrics • Metrics for comparing network topologies • To connect n Pes with mxm crossbars • Single link bandwidth b bytes/s • Simple: n = 14 (2 switches) • Diameter 3 • Bisection Bandwidth b 1 2 3

Networks - Performance Metrics • Fat-tree • Diameter: 2 logmn • Height is logmn • Worst case distance - up and down • Bisection Bandwidth: b n/2 bytes/sec • Links are fatter higher up the tree logmn

Networks - Performance Metrics • Mesh • Diameter: 2Ön-2 • Bisection Bandwidth: b Ön bytes/sec • Order: 4

Networks - Performance Metrics • Hypercube • Hypercube of order m • Link 2 order m-1 hypercubes with 2m-1 links • Number of PEs: n = 2m • Order: log2n = m Order 2 Hypercube Order 2 Hypercube Order 3 Hypercube

Networks - Hypercubes • Embedding property • In an n PE hypercube,we have hypercubes of size n/2, n/4, … • Number PEs with binary numbers • 000, 001, 010, 011, 100, … • Joining two hypercubes • add one binary digitto the numbering • Each PE is connectedto every PE whoseindex differs in only one bit

Networks - Hypercubes • Embedding property • Partitioning tasks • Allocate to sub-cubes • Sub-tasks allocated tosub-cubes of that cube,etc

Futures

VLIW - Very Long Instruction Word • Instruction word: multiple operations • n RISC-style instructions • Architecture: fixed set of functional units Each FU matched to a “slot” in the instruction

VLIW - Very Long Instruction Word • Compiler responsible for allocating instructions to words • Burden squarely on compiler • Needs to produce near optimal schedule • Inevitable: large number of empty slots! • Lower code density • Similar to superscalar • but instruction issue flexibility missing • VLIW simpler Ü faster? • Re-compilation needed • Each new generation will have different functional unit mix

Synchronous Logic Systems • Clock distribution • Major problem for chip architect • Clock skews < 100-200ps over whole die • 10% of cycle time • Small changes • Re-engineer whole chip • Checking for data hazards & logic races

Synchronous Logic Systems • Clock distribution • Power consumption • Major problem @ 30W+ per chip • CMOS logic consumes power only on switch • but synch systems clock a lot of logic on every cycle • Clock is distributed to every subsystem • Even if the logic of the subsystem is disabled!

Synchronous Logic Systems • Clock distribution • Power consumption • Worst case propagation delay • Determines maximum clock speed • Clock edge must wait until all logic has settled • Temperature and process fabrication • Even slower clocks • Design is simpler • Logic designers have experience • Good tools

Asynchronous Logic Systems • Clock distribution • No longer a problem • Synchronisation bundled with data • Circuits are composable • No global clock … • No need to re-engineer a whole chip to change one section! • Known correct circuits can be combined • Power consumption • Circuits switch only when they’re computing • Potentially very low power consumption • May be the biggest attraction of asynch systems!

Asynchronous Logic Systems • Clock distribution problem removed • Circuits are composable • Power consumption • Average case propagation delay • Completion signal generated when result is available • Independent of • Temperature and process fabrication • Design is harder • Experience will remove this?

CA406

CA406

Presentation Transcript

CA406