HPEC using FPGAs

HPEC using FPGAs Challenges and Benefits Dr. Aravind Dasu Assistant Professor Electrical & Computer Engineering

Utah State University Cache Valley 90 miles North of Salt Lake City David. Sant. Engineering Innovation Building

Agenda • On-board computing for Spacecraft • A primer on FPGAs (5 slides) • HPEC using FPGAs (26 slides) • The Polymorphic Systolic Array Framework • Improving productivity • Enabling real time and responsive reconfiguration • Future technologies for FPGAs • Acknowledgements

On-board Computing • Civilian and Military space missions getting more complex • Need to support several types of data from several types of sensors • Missions will require spacecraft computer to be more responsive • Need for In-situ data processing (signal processing) • Not just compression, but data analysis, decision making etc. • Power budget, form factors of spacecraft computer extremely tight • State of the art RadHard microprocessor from BAE systems or RISC processor? • Aging workhorse, time to upgrade big time

So, what do we upgrade to? • Commodity Microprocessors • Cell, GPU, Many/Multi core • Very powerful • Blows out the power budget • RadHard parts need to be custom ordered • Commodity DSP chips • Good as long as you stick to just one chip • Rahhard parts can be custom ordered • Commodity Reconfigurable chips • FPGAs (field programmable gate arrays) • Can perform like a custom silicon chip • Best performance/power ratios • RadHard parts already available with steady roadmap from Xilinx

Programming perspective • Microprocessors • Optimistic view point • DSP chips • FPGAs Raw ingredients Frozen pizza Take ‘n’ bake

Quick Primer on FPGAs • Mixture of blocks on a die • Some dedicated • DSP (MAC units) • PPC (optional) • RAM • Some programmable • Look Up Tables (LUT) • Gazillions of network switches • Hidden • Special circuit • ICAP (internal configuration access port)

Simple View of Programming an FPGA Each bit stored in a small memory cell (SRAM) • An FPGA is essentially a vast set of SRAM cells waiting to be loaded with 0s and 1s to mimic Boolean logic All computations are assumed to be based on Boolean Logic So, Problem solving concept => algorithms Algorithms => Discrete set of simple tasks (add/multiply…) Simple tasks => A set of Boolean functions talking to each other Boolean function=> simple manipulation of 1 and 0 bits NMOS transistor

Programming an FPGA • Each Look Up Table (LUT) has a unique mailing address • 16 bits go into each Look Up Table (LUT) • Each routing switch has a unique mailing address • One bit for each switch • Executable for an FPGA is sequence of bits that have to be delivered precisely to each LUT and Switch Box • This binary/executable is called “Configuration Bitstream” or simply “Bitstream”

Programming an FPGA • Programming the FPGA is like having a Mailman deliver bits to each address correctly • Slow process • But a Bitstream is slightly more complex • Each FPGA is like a Country (has a unique code) • A “Bitstream” before entering the chip has to undergo security clearance (CRC or cyclic redundancy check) • Port of Entry = ICAP • FPGA addresses are hierarchical (state, county, city, suburb, house address) • Term used for encoding all this overhead is “Frame Address” • All this address stuff is overhead • Actual useful stuff is inside the mail envelope

So what does a real configured/programmed FPGA look like? Before Programming Nice clean plate Empty LUTs, Switches…. After Programming Messy plate of spaghetti Configured LUTs, Switches…. All those green things are wires that have been setup to carry data between LUTs, FFs etc…

High Performance Embedded Computing (HPEC) using FPGAs • Signal processing algorithms • Wildly useful and hence widely used • Computationally quite parallel/pipeline-amenable • Proven to be accelerate-able by Systolic Array designs on FPGAs • The Good of FPGAs: • FPGAs claim to have orders of magnitude performance advantage over DSP chips (www.xilinx.comwww.altera.com) • They can be reconfigured partially and dynamically • The Bad (no the Ugly): • Productivity is the biggest barrier • The number of signal processing folks willing to adopt FPGAs is small and stagnant • Partial dynamic reconfiguration is very slow compared to processing speeds

Elaborating the Good of FPGAs: Extreme DSP computing

FPGA Circuitα Circuitα Circuitα Circuitα 4 parallel processing circuits for Application α Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ 6 parallel processing circuits for application β Elaborating the Good of FPGAs:Partial Dynamic Reconfiguration At some point in time…… FPGA Abruptly…say we need to quickly increase parallelism support for application α ( 5) At the cost of taking away parallelism support for the other application, Because we did not have enough space on the chip to support high levels of parallelism for both applications, or There was a power budget we couldn’t satisfy FPGA FPGA FPGA Can we dynamically reconfigure the chip, without disturbing the execution of either application? And do it fast enough? Remember, programming the FPGA is a very very very slow process: RELATIVE to execution speeds of applications Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα Circuitα 4 parallel processing circuits for Application α Four parallel processing circuits for Application α 4 parallel processing circuits for Application α 5 parallel processing circuits for Application α Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Circuitβ Seven parallel processing circuits for application β 7 parallel processing circuits for application β 6 parallel processing circuits for application β 7 parallel processing circuits for application β

Productivity • It’s a funny thing in the FPGA world • FPGA programmers are essentially VLSI design guys • They don’t buy $5K parts to get average performance • Every clock cycle is precious • Every LUT/FF/MAC/BRAM is precious • They don’t adopt new programming languages in a hurry • They love to have full control over every operation

Productivity, so what does it mean? • Wants an entire system on FPGA modeled, performance predicted, designed, implemented, debugged, verified, guaranteed timing closure, low power, high throughput…. • Done really really fast, just like software • And then wants to make some minor changes and do it quickly all over again, just like software…

Why cant new designs be compiled, loaded onto FPGAs and tested super fast? • Need to look at traditional design flow • Hardware-Software partition (quick) • Create macro and micro architectures for hardware portion (a month, two months..) • Write bug free VHDL/Verilog code for architectures (a few months) • Synthesize, translate, map, place and route (5 to 15 hours) • Simulate • If there is a functional or timing bug, you pay a penalty of a few days to weeks • Load configuration onto chip • Test again. • If there is a timing bug, you pay a penalty of several weeks • If you decide to make a micro architecture change, go back to step 2 • Good luck trying to finish your project on time and budget • This will still not get you a dynamically reconfigurable design

One way to Improve Productivity • Stick to the traditional design flow as much as possible • FPGA users are once bitten twice shy • Very conservative and believe in the existing flow • But introduce structure into the flow, i.e. physical structure, macro-architecture structure • Make Partial Dynamic Reconfiguration (PDR) almost automatic • FPGA designers are not conversant with PDR designs

Augmented Design Flow: Exclusively for Signal Processing Algorithms • Hardware-Software Partitioning (just a concept and specific to an application) • Structured Macro-architecture via Floor Planning • Generic structure applicable to many algorithms • Structure Micro-architecture design • Project, Schedule data flow model of Sig. Proc. Kernel onto things called Sockets of Macro-architecture • Well understood process • Embed dynamic reconfiguration capability • New technology • Works in tandem with Macro-architecture • Code, Synthesize…. • Test on chip

Structured Macro-architecture • Some important Terms/Elements: • Socket: A physical region on the FPGA chip reserved by designer to be loaded with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR) • Switch Box: A circuit that makes the array of Sockets re-partition-able • PE/Processing Element: A circuit/bitstream to implement a signal processing kernel’s systolic array data-flow functionality. To activate a socket, a PE must be loaded into it

Socket/PRR: Under the Hood Yellow box: A socket/PRR It contains BRAMs, MACs and LUTs/FFs (purple and blue/green/black stuff) If you want to dynamically reconfigure the parallelism of Systolic Arrays on an FPGA: All PRRs must be created with identical resources ofMACs, BRAMs, LUTs, FFs. Physical fabric of Virtex SX 35 FPGA

Switch Box: Stuff that makes the Array of Sockets Re-partition-able Simple circuit Need to set mux sel lines & fifo controls Resides in static region on FPGA Change SB connections to change partitioning of sockets/PRRs between systolic array kernels’ nodes

Ok, time to port Macro-architecture Framework onto Chip

What really happened when we tried it Virtex 4 SX 35 • Static region • (luminescent green stuff) • Microprocessor • Switch Boxes • Cache • Controller • PRRs/Sockets • (white boxes) • To be filled with Systolic Array Processing Elements

Now to the Micro-architecture…First, Hardware Software Partitioning Example: Extended Kalman Filter (EKF). A critical navigation algorithm and a nasty signal processing kernel. All stuff with rounded edges are tasks that can change based on physics of the problem. So put it all in software (Microblaze). All else is consistent and so put them in hardware (PolySAF)

Designing/Deriving the Processing Element: Example EKF Works on Faddeev Algorithm to compute Schur compliment

One of the many possible ways Port

Code, Synthesize, …Optimize • Port: Code, synthesize, Translate, Map, Place and Route • For One Socket/PRR (just a few days worth of work) • Move Nets around to meet timing: Manually pick up a wire in this small bowl of spaghetti of wires, and move it around. • Nuisance of a task, but necessary • But you need to do it only in one PRR (just a few hours worth of work) • Copy Locally optimized bitstream/circuit of the one PRR to all PRRs • Automatically obtain Global Timing closure for the PolySAF • If Microprocessor, Cache are retained for multiple designs, then global timing closure for whole chip is also automatically gifted to you

Have we answered the Productivity problem? Time to Grade the Approach • Need to look at traditional design flow • Hardware-Software partition (quick) • Create macro and micro architectures for hardware portion (a month, two months..) • Applicable to a wide range of Sig. Proc. Algorithms • Write bug free VHDL/Verilog code for architectures (a few months) • Reuse most of the macro structure and code only for one PRR • Synthesize, translate, map, place and route (5 to 15 hours) • Do for only one PRR • Simulate • If there is a functional or timing bug, you pay a penalty of a few days to weeks • Load configuration onto chip • Test again. • If there is a timing bug, you pay a penalty of several weeks • If you decide to make a micro architecture change, go back to step 3 • Good luck trying to finish your project on time and budget

Want the details, the math, the algorithms etc? • Read this paper • A. Sudarsanam, R. Barnes, A. Dasu, J. Carver, and R. Kallam, “Dynamically Reconﬁgurable Systolic Array Accelerators: A case study with EKF and DWT Algorithms,” IET/IEE Computers & Digital Techniques. Vol 4, Issue 1. Jan 2010. • Author preprint available on line at Reconfigurable Computing Group • www.usu.edu/rcg

Now, onto Partial Dynamic Reconfiguration in the PolySAF 3 nodes EKF 2 nodes DWT Detach Socket 2 nodes EKF 2 nodes DWT Reconfigure Reset new PRR Re-attach 2 nodes EKF 3 nodes DWT DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression

How to Physically Reconfigure PRR? • Known Methods

Comparison of all known options Best known technique: from Microsoft Research Labs (2008) eMIPS project Too Slow, Too expensive (hogs up valuable on-chip BRAMs)

Embedding Dynamic Reconfiguration into the System • Active Bitstream (PRR) to PRR: Hardware Circuit ARC ICAP wrapper FPGA ICAP snoop PRR (source) active bitstream PRR (destination) PRR (destination)

Accelerated Relocation Circuit (ARC) • Manipulate Frame addresses • FAR is Frame address register • Lots of unnecessary overhead can be avoided • No need for CRC processing

Results…reconfiguration times in millisecs Footprint of ARC: 1064 LUTs, 638 FFs and 1 BRAM All systems run @ 100 MHz * Estimated values for state of the art competing technologies

Next steps…Improve, Formalize and Collaborate • Performance prediction Model • Predict how big circuit will be, how it will perform using Excel and Matlab • Big leap in productivity • Arithmetic Precision manipulation is extraordinarily powerful when it comes to FPGAs • If the right non-IEEE precision can be chosen for a Sig. Proc. App. Then you can save medium to massive amounts of area, power in the circuit mapped onto the FPGA • Great opportunity for Small Satellites • Efficient communication between Microprocessor and PolySAF via threads • Validateand brutally test this on a large number of algorithms (FFTs, Filters, Hyperspectral processing…..) • NASA can help with this • Technology is attractive for software defined radios, precision navigation…

Kaleidoscope: Future of FPGA • Near term • Maybe better tools to program and debug FPGAs? • Mentor’s Catapult, AutoESL compiler, Synfora compiler…. • Maybe some sort of standardization in FPGA programming • Hopefully DARPA HPCS program will produce something • Longer term (Revolutionary things to come) • Vertically Integrated FPGA + DRAM on a single chip • 1000x improvement in performance/watt • Visit Micron Research Center at USU to learn more • www.usu.edu/mrc

Acknowledgements • Joe Bredekamp and the NASA AISR program • Applied Information Systems Research • Funding from NASA is valuable • Focused research • Want my technology to be adopted for real missions • Xilinx and Mentor Graphics (donated > $ 100K worth software) • My Grad Students

HPEC using FPGAs

HPEC using FPGAs

Presentation Transcript

Accelerating DSP Algorithms Using FPGAs

HPEC 2010 Acknowledgments

Digital Correlator Design Using Vertex-2 FPGAs

HPEC 2008 Acknowledgments

Designing MIMO Modems on FPGAs Using Simulink

HPEC 2012

Backplane Design and Optimization Using 28nm FPGAs

PARTIAL RECONFIGURATION USING FPGAs: ARCHITECTURE

Overview of HPEC Benchmarks

Embedded System Design Using FPGAs

Configuration of FPGAs Using (JTAG) Boundary Scan

FPGAs

Using FPGAs as device

HARDWARE BASED PACKET FILTERING USING FPGAs

Tools for synthesis and implementation using Xilinx FPGAs

Actel FPGAs

Hardware Acceleration of Applications Using FPGAs

Networking Virtualization Using FPGAs

HPEC DYNASTY

CSE 670 Embedded System Design Using FPGAs

Floating Point Vector Processing using 28nm FPGAs

Spartan FPGAs