ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed and Configurable

ECE 697FReconfigurable ComputingLecture 4Contrasting Processors: Fixed and Configurable

Overview • Three types of FPGAs • EEPROM • SRAM • Antifuse • SRAM FPGA architectural choices. • FPGA logic blocks -> size versus performance. • FPGA switch boxes • State-of-the-art • Research issues in architecture.

What is Computation? • Calculating predictable data outputs from data inputs. What should we expect from a computing device? • Gives correct answer. • Takes up finite space • Computes in finite time • Can solve all problems? • Compilation • Implementation Other issues

Performance Full Custom Gate Array high FPGA Processor low Compilation time low high Compilation • How long does it take to “map” an idea to hardware? • Why is the processor so “easy” to target for compilation?

What are variables in Computation? • Time -> How long does it take to compute the answer? • Area -> How much silicon space is required to determined the answer? • Processor generally fixes computing area. Problem evaluated over time through instructions. • FPGA can create flexible amount of computing area. Effectively, the configuration memory is the computing instruction.

8λ 8λ 8λ 3λ 5λ spacing metal 3 overlap metal 2+3 Measuring Feature Size • Current FPGAs follow the same technology curve as microprocessors. • Difficult to compare device sizes across generations so we use a fixed metric, lambda ( ). • Lambda defines basic feature sizes in the VLSI device. λ

Toward Computational Comparison Dehon metrics: Computationaldensityofadevice 4 input gate-evaluations λ2 x s Processor: 2 x NALU x WALU Aproc x tcycle FPGA: N4lut Aarray x tcycle

A A B B Degradation • FPGA can’t really be clocked at 1/7 ns due to interconnect. • Consider the Bubblesort block from the first class. compare If (A > B) { H = A; L = B; } else { H = B; L = A; } H requires 33 LUT delays Co 0 0 0 1 0 1 1 1 S 0 1 1 0 1 0 0 1 Ci 0 0 0 0 1 1 1 1 A 0 0 1 1 0 0 1 1 B 0 1 0 1 0 1 0 1

New Comparison • Processor required three cycles at 500 MHz • FPGA requires 33 LUTs delays per computation. Could consider other parts of design.

A6 A0 A2 A4 A7 A1 A3 A5 H H H H L L L L Parallelization • How this performance factor change over time? – through parallelization. For a given operation ge/(λ2.s) seems the same -> 7 However, multiple comparisons could be performed in parallel. Now FPGA metric is 28 Of course, device may be only partially filled.

constant variable Specialization • Example: encryption

A B operation +, -, |, x Instructions • Many applications have little parallelism or have variable hardware requirements during execution. • Here using more area doesn’t increase computational density. • Better to reuse hardware through instructions

op multi-bit Single-Instruction Multiple Data • Same instruction distributed to fine-grained cells. • Typically organized as 2-D array • Ideal for image processing • Typically fixed hardware located in cell

. . . . . . From local state or other array elements To local state or other array elements Global Instruction common to all elements Computation Unit for SIMD • Performs different operation on every cycle • Easy to distribute instructions on device (use global lines) • Some local storage for data in each tile

. . . . . . From local state or other array elements To local state or other array elements Static instruction distinct for each array element Computation Unit for FPGA • Performs same operation on every cycle • No global distribution of instructions at all (stored locally) • Also has local storage for data.

Context Identifier Address Inputs (Inst. Store) . . . Computation Unit (LUT) . . . in Programming may differ for each element out Hybrid Architecture • Configuration selects operation of computation unit • Context identifier changes over time to allow change in functionality • DPGA – Dynamically Programmable Gate Array

A3 A2 A1 + + + O2 O1 O3 B1 B3 B2 DPGA • How many applications require this flexibility • Efficient techniques needed to schedule when functionality shifts. • Added configuration allows for functionality to change quickly • Doubles SRAM storage requirement A0 + O0 B0 context identifier

Actxt80Kl2 dense encoding Abase800Kl2 Slides: courtesy DeHon Actxt :Abase = 1:10 Multicontext Organization/Area

Example: DPGA Prototype

FPGA vs. DPGA Compare

Example: DPGA Area

Context ID Config Store LUT Out Configuration Caching • What if I swap out some unused configurations while they are not used? • Separate hardware to write given locations in hardware (config mem) and not interrupt circuit operation • Just like cache prefetching

Hierarchical FPGA • Predictable Delay • Two dimensional layout • Limited connectivity

s D Q Q D Buffering • Pipelining interconnect comes at an area cost • Also could consider buffering Unpipelined s Pipelined s 18 transistors

What about this circuit? • Retiming needed for hierarchical device. • Number of registers proportional to longest path. Complicates design Software, debugging Need to schedule communication LUT

PLD (Programmable Logic Device) • All layers already exist • Designers can purchase an IC • Connections on the IC are either created or destroyed to implement desired functionality • Field-Programmable Gate Array (FPGA) very popular • Benefits • Low NRE costs, almost instant IC availability • Drawbacks • Penalty on area, cost (perhaps $30 per unit), performance, and power • Acknowledgement: Mishra

Compilation/ Synthesis Libraries/ IP Test/ Verification System specification System synthesis Hw/Sw/ OS Model simulat./ checkers Compilation/Synthesis: Automates exploration and insertion of implementation details for lower level. Behavioral specification Behavior synthesis Cores Hw-Sw cosimulators Libraries/IP: Incorporates pre-designed implementation from lower abstraction level into higher level. RT specification RT synthesis RT components HDL simulators Test/Verification: Ensures correct functionality at each level, thus reducing costly iterations between levels. Logic specification Logic synthesis Gates/ Cells Gate simulators To final implementation Design Technology • The manner in which we convert our concept of desired system functionality into an implementation

10,000 100,000 1,000 10,000 100 1000 Logic transistors per chip (in millions) Gap Productivity (K) Trans./Staff-Mo. 10 100 IC capacity 1 10 0.1 1 productivity 0.01 0.1 0.001 0.01 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 Design productivity gap • 1981 leading edge chip required 100 man-months • 10,000 transistors / 100 transistors/month • 2002 leading edge chip requires 30K man-months • 150,000,000 / 5000 transistors/month • Designer cost increase from $1M to $300M

Team 15 60000 16 16 18 50000 19 40000 23 24 30000 Months until completion 20000 43 Individual 10000 0 10 20 30 40 Number of designers The mythical man-month • In theory, adding designers to team reduces project completion time • In reality, productivity per designer decreases due to complexities of team management and communication overhead • In the software community, known as “the mythical man-month” (Brooks 1975) • At some point, can actually lengthen project completion time! • 1M transistors, one designer=5000 trans/month • Each additional designer reduces for 100 trans/month • So 2 designers produce 4900 trans/month each

Summary • Interesting similarities between processor and reconfigurable device • Processors are reconfigured on every clock cycle using an instruction • FPGAs configured once at beginning of computation • DPGAs blur the line – run-time reconfiguration • Numerous challenges to reconfiguration • When • How • Performance benefit?

ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed and Configurable

ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed and Configurable

Presentation Transcript

ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II

ECE 636 Reconfigurable Computing Lecture 9 Multi-FPGA System Software

ECE 636 Reconfigurable Computing Lecture 16 Power Reductions Techniques for FPGAs

ECE 636 Reconfigurable Computing Lecture 1 Course Introduction Prof. Russell Tessier

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement

Lecture 14 Reconfigurable Computing Basics

ECE 506 Reconfigurable Computing Lecture 7 FPGA Placement

ECE 636 Reconfigurable Computing Lecture 12 High-Level Compilation

ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review

Lecture 11: Interfaces, I/O and Configurable Processors

ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture

ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications

Configurable, reconfigurable, and run-time reconfigurable computing

ECE 697F Reconfigurable Computing Lecture 5 Technology Mapping: Packing Logic into LUTs

ECE 636 Reconfigurable Computing Lecture 5 FPGA Routing

ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs

ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design

Reconfigurable Computing Lecture 14

Lecture 13: (Re)configurable Computing