340 likes | 478 Views
Hardwired networks on chip for FPGAs and their applications. Kees Goossens (TU Delft, NXP) Muhammad Aqeel Wahlah (TU Delft). Kees Goossens (NXP, TUD) Muhammad Aqeel Wahlah (TUD). overview. applications network on chip FPGA key ideas hardwired NOC unified interconnect
E N D
Hardwired networks on chip for FPGAsand their applications Kees Goossens (TU Delft, NXP) Muhammad Aqeel Wahlah (TU Delft) Kees Goossens (NXP, TUD) Muhammad Aqeel Wahlah (TUD)
overview • applications • network on chip • FPGA • key ideas • hardwired NOC • unified interconnect • data coercion / type casting • application: dynamic partial reconfiguration • multiple concurrent applications • multiplex sub-applications (“hardware tasks”) • example • conclusions
BA A1 A2 BAC C1 C2 C3 T1 T2 T3 applications • task / function mapped on IP • includes local storage / buffering • application: set of communicating IPs / tasks / ... • data, control, code • communication via connections • use case: set of concurrent applications
network on chip (NOC) • connects ports on hardware blocks (IP) • data, control • connections: virtual wires • real-time / quality of service • programmable at run-time • set up & remove connections by programming control registersin the NOC • styles of communication • address-based /memory-mapped • streaming T3 A1 A2 IP NOC NI NI BA IP IP NI R R NI T2 IP R NI BAC IP T1
LUT LUT LUT LUT FPGA fabric IO processor LUT • soft IP are configured in • configurable elements (LUT) • and switch boxes (not shown) • with a given configuration granularity (frame) using the configuration interconnect (ICAP) • hard IP • CPU • on-chip memories (BRAM, ...) • off-chip memory interfaces • decryption IP • etc. CPU LUT de/encrypt accelerator off-chipmemory LUT on-chip memory LUT on-chip memory configuration: bitstream loading programming / control: set MMIO registers xilinx terminology (frames, ICAP, etc.) ICAP
LUT LUT LUT LUT application on FPGA IO processor LUT soft control interconnect soft data interconnect A1 A2 • design an application as for ASIC • IPs, interconnect, storage, sw • but map on soft & hard IP resources • traditionally have separate softdata and control interconnects • could also use soft NOC for both CPU frame de/encrypt accelerator off-chipmemory BAC frame BA A1 A2 BAC on-chip memory BA frame on-chip memory ICAP
LUT LUT LUT LUT T1 T2 T3 multiple applications on FPGA IO processor LUT soft control interconnect soft data interconnect A1 A2 • interconnects and IPs of different applications share reconfiguration regions (frames) • dynamic reconfiguration is global, not partial CPU T3 LUT de/encrypt accelerator T1 off-chipmemory BAC LUT BA A1 A2 BAC on-chip memory BA LUT T2 on-chip memory ICAP
overview • application • network on chip • FPGA • key ideas • hardwired NOC improved performance : cost • unified interconnect flexibility • data coercion / type casting cool (and useful) applications • application: dynamic partial reconfiguration • multiple concurrentapplications • multiplex sub-applications (“hardware tasks”) • example • conclusions
1. hardwired interconnect hardinterconnect(s) IO processor CFR A1 A2 • replace soft interconnect(s)by hard interconnect(s) • connect reconfifgurable regionsof LUTs (CFR) • bit-level reconfigurability (CFR) • switch boxes • transaction-levelreconfigurability (NOC) • routers, NIs • memory mapped / streaming [Hecht FPL’05] CPU T3 CFR de/encrypt accelerator off-chipmemory BAC CFR T1 on-chip memory BA CFR T2 on-chip memory ICAP
1. hardwired interconnect hardinterconnect(s) IO processor CFR c3 C1 • ~35 X smaller area • ~3.5 X higher speed • ~150 X better perf:cost ratio(bits/sec/area) • ~200 X smaller configuration footprint(program MMIO, no bitstream) • ~200 X faster soft IP load & boot • dynamic partial reconfiguration • no constraints on soft IP placement due to communication • loss of flexibility • fewer LUTs • CFR = frame 7% hard NOC [based on Virtex4 & Aethereal NOC, Goossens NOCS’08] C2 CPU T3 CFR de/encrypt accelerator off-chipmemory BAC CFR T1 on-chip memory CFR T2 on-chip memory ICAP
performance & cost • essentially, it all depends on • area soft:hard ≈ 35:1 • speed soft:hard≈ 3.5:1 • configuration footprint of soft NOC (bitstream) :programming footprint of hard NOC (MMIO registers) ≈ 214:1 • resulting in • boot time soft:hard ≈ 1:200 • functional performance:cost (bit/sec:area) soft:hard ≈ 1:147
performance & cost • configuration speed • 1.9 Gb/s for dedicated configuration interconnect (ICAP) • 8 Gb/s for hard NOC • programming speed • 118 MHz soft NOC • 500 MHz hard NOC • configuration footprint for soft NOC • 1.8 Mb (8300 LUTs per router+NI) • programming footprint for hard NOC • 2100 bit per connection • thus to configure & program an NI • 1 msec for soft NOC • 10.6 μsec for hard NOC
2. unified interconnect single hardinterconnect IO processor CFR A1 A2 • one interconnect (e.g. NOC) for • data for functional mode • control for programming • bitstreams for configuration • dynamic partitioning of different interconnects CPU T3 CFR de/encrypt accelerator off-chipmemory BAC CFR T1 on-chip memory BA CFR T2 on-chip memory ICAP
3. data coercion bitstream single hard interconnect IO processor CFR • data = control = bitstream = test = … • connect a data portto a configuration port • decrypt bitstreams CPU CFR de/encrypt accelerator off-chipmemory CFR data on-chip memory CFR on-chip memory
3. data coercion single hard interconnect IO processor CFR • data = control = bitstream = test = … • connect a data portto a configuration port • decrypt bitstreams • relocate bitstreams • run-time compute / optimise bitstreams • JIT, peephole CPU PH CFR de/encrypt accelerator bitstream off-chipmemory CFR on-chip memory CFR IP on-chip memory
3. data coercion single hard interconnect IO processor CFR • data = control = bitstream = test = … • connect a data portto a configuration port • decrypt bitstreams • relocate bitstreams • run-time compute / optimise bitstreams • JIT, peephole • data port to test port (NOC as TAM) • on-line (structural) testing • on-chip test-vector generation CPU PH CFR de/encrypt accelerator bitstream off-chipmemory CFR on-chip memory CFR IP on-chip memory
overview • applications • network on chip • FPGA • key ideas • hardwired NOC • unified interconnect • data coercion / type casting • application: dynamic partial reconfiguration • multiple concurrent applications • multiplex sub-applications (“hardware tasks”) • example • conclusions
BA A1 A2 BAC C1 C2 C3 T1 T2 T3 dynamic partial reconfiguration: idea • “hardware operating system” implements run-time scheduling of • multiple concurrent applications • independent applications on own virtual platform • no communication, no interference • “performance virtualisation” • activation given by user, environment, etc. app T A AC app D time
dynamic partial reconfiguration: idea • “hardware operating system” implements run-time scheduling of • multiple concurrent applications • parts of single applications (soft IP, “hardware tasks”) • multiplex parts of a single application on same resources sub-app A or sub-app C app T A C app D BA A1 A2 C1 C2 C3 time
BA A1 A2 BAC C1 C2 C3 dynamic partial reconfiguration: idea • “hardware operating system” implements run-time scheduling of • multiple concurrent applications • parts of single applications (soft IP, “hardware tasks”) • multiplex parts of a single application on same resources • internal state state app T A C app D time
dynamic partial reconfiguration: implementation • system manager • resource management (CFR, NOC, memory, …) • inter-application virtual platforms T application manager A C BAC application manager system manager time
dynamic partial reconfiguration: implementation • system manager • resource management (CFR, NOC, memory, …) • inter-application virtual platforms • intra-application phases • NOC programming • soft IP / (sub)-application configuration (incl. clock, reset) • bottleneck? A C BAC application manager system manager time
dynamic partial reconfiguration: implementation • system manager • application manager • application programming T application manager A C BAC application manager system manager time
BA A1 A2 BAC C1 C2 C3 dynamic partial reconfiguration: implementation • system manager • application manager • application programming • intra-application persistent data management state A C BAC application manager system manager time
overview • applications • FPGA • network on chip • key ideas • hardwired NOC • unified interconnect • data coercion / type casting • application: dynamic partial reconfiguration • multiple concurrentapplications • multiplex sub-applications (“hardware tasks”) • example • conclusions
modelling • SystemC • bit & cycle accurate NOC model • behavioural CFR models • accurate bitstream structure • behavioural hard IP models • model • starting / stopping of applications • dynamic, based on user input • starting / stopping of sub-applications • dynamic, based on flow of data • configuration: loading of bitstreams for soft IP; clock & reset • programming: of NOC, system & sub-application managers • management of persistent state
example single hard interconnect IO processor CFR A1 A2 • system manager • program NOC for configuration CPU systemmanager CFR de/encrypt accelerator off-chipmemory BAC CFR applicationmanager on-chip memory BA CFR on-chip memory
bitstream programming example data single hard interconnect IO processor CFR A1 A2 • system manager • program NOC for configuration • configure: load bitstreams • including bitstream syntax, etc. CPU systemmanager CFR de/encrypt accelerator off-chipmemory BAC CFR applicationmanager on-chip memory BA CFR on-chip memory
bitstream programming example data single hard interconnect IO processor CFR A1 A2 • system manager • program NOC for configuration • configure: load bitstreams • program NOC for (sub)-application A CPU systemmanager CFR de/encrypt accelerator off-chipmemory BAC CFR applicationmanager on-chip memory BA CFR on-chip memory
bitstream programming example data single hard interconnect IO processor CFR A1 A2 • system manager • program NOC for configuration • configure: load bitstreams • program NOC for (sub)-application A • program & start application manager • including clocking & reset CPU systemmanager CFR de/encrypt accelerator off-chipmemory BAC CFR applicationmanager on-chip memory BA CFR on-chip memory
bitstream programming example data single hard interconnect IO processor CFR A1 A2 • system manager • program NOC for configuration • configure: load bitstreams • program NOC for (sub)-application A • program & start application manager • application manager • programs & starts sub-app A • soft IP fn is modelled by CFR CPU systemmanager CFR de/encrypt accelerator off-chipmemory BAC CFR applicationmanager on-chip memory BA CFR on-chip memory
bitstream programming example data single hard interconnect IO processor CFR A1 A2 • system manager • program NOC for configuration • configure: load bitstreams • program NOC for (sub)-application A • program & start application manager • application manager • programs & starts sub-app A • sub-application A runs CPU systemmanager CFR de/encrypt accelerator off-chipmemory BAC CFR applicationmanager on-chip memory BA CFR on-chip memory
conclusions • ideas: • hardwired NOC performance:cost • unified interconnects hardware multi-tasking • data coercion / type casting cool & useful • very detailed model • many simplifications & restrictions • many open issues • design flow: soft IP placement, binding, relocation, etc. [Madsen?] • application model: • extend use-case model with intra-application dynamism • more general notions of persistent state • implementation: separation of system & application managers