Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Accelerating Applications using FPGAsSatnam Singh, Microsoft Research, Cambridge UK

A Heterogeneous Future

Example Speedup: DNA Sequence Matching

Why are regular computers not fast enough?

FPGAs are the Lego of Hardware

multiple independent multi-ported memories hard and soft embedded processors fine-grain parallelism and pipelining

The heart of an FPGA

LUT4 (OR)

LUT4 (AND)

LUTs are higher order functions i3 i2 i1 i2 i1 o o o i o i1 i0 i0 i0 lut1 lut2 lut3 lut4 inv = lut1 notand2 = lut2 (&&) mux = lut3 (ls d0 d1 . if s then d1 else d0)

FPGAs as Co-Processors XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940

What kind of problems fit well on FPGA?

scientific computing data mining search image processing financial analytics opportunity challenge

Fibonacci Example 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...

entity fib is port (signalclk, rst : in bit ; signalfibnr : out natural) ; endentityfib ; architecture behavioural of fib is signallastFib, currentFib : natural ; begin compute_fibs : process begin waituntilclk'eventandclk='1' ; ifrst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; endif ; end process compute_fibs ; fibnr <= currentFib ; end architecture behavioural ;

demonstration...

FPGA hardware (VHDL) GPU code (Accelerator) data parallel descriptions C++ SMP

The Accidental Semi-colon ;

Kiwi gate-level VHDL/Verilog Kiwi C-to-gates structural parallel imperative imperative (C) thread 1 ; ; thread 3 ; thread 2 jpeg.c

Kiwi Library circuit model Kiwi.cs JPEG.cs Visual Studio Kiwi Synthesis multi-thread simulation debugging verification circuit implementation JPEG.v

circuit C to gates Thread 1 parallel program circuit C to gates Thread 2 C# C to gates circuit Thread 3 circuit Thread 3 C to gates Verilog for system

Our Implementation • Use regular Visual Studio technology to generate a .NET IL assembly language file. • Our system then processes this file to produce a circuit: • The .NET stack is analyzed and removed • The control structure of the code is analyzed and broken into basic blocks which are then composed. • The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.

System Composition • We need a way to separately develop components and then compose them together. • Don’t invent new language constructs: reuse existing concurrency machinery. • Adopt single-place channels for the composition of components. • Model channels with regular concurrency constructs (monitors).

Writing to a Channel publicclassChannel<T> { T datum; bool empty = true; publicvoid Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Reading from a Channel public T Read() { T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r; }

user applications domain specific languages rendezvous join patterns transactional memory data parallelism systems level concurrency constructs threads, events, monitors, condition variables

classFIFO2 { [Kiwi.OutputWordPort(“result“, 31, 0)] publicstaticint result; staticKiwi.Channel<int> chan1 = newKiwi.Channel<int>(); staticKiwi.Channel<int> chan2 = newKiwi.Channel<int>();

publicstaticvoid Consumer() { while (true) { inti = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } } publicstaticvoid Producer() { for (inti = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }

publicstaticvoid Behaviour() { ThreadProducerThread = newThread(newThreadStart(Producer)); ProducerThread.Start(); ThreadConsumerThread = newThread(newThreadStart(Consumer)); ConsumerThread.Start();

Filter Example thread one-place channel

publicstaticint[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = newint[size]; int[] result = newint[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (inti = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

Transposed Filter

staticvoidTap(inti, bytew, • Kiwi.Channel<byte> xIn, • Kiwi.Channel<int> yIn, • Kiwi.Channel<int> yout) • { • bytex; • int y; • while(true) • { • y = yIn.Read(); • x = xIn.Read(); • yout.Write(x * w + y); • } • }

Inter-thread Communication and Synchronization // Create the channels to link together the taps for (int c = 0; c < size; c++) { Xchannels[c] = newKiwi.Channel<byte>(); Ychannels[c] = newKiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros }

// Connect up the taps for a transposed filter for (inti = 0; i < size; i++) { • int j = i; // Quiz: why do we need the local j? ThreadtapThread = newThread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start(); }

using System; usingSystem.Collections.Generic; usingSystem.Text; usingMicrosoft.Research.DataParallelArrays; usingPA = Microsoft.Research.DataParallelArrays.ParallelArrays; usingIPA = Microsoft.Research.DataParallelArrays.IntParallelArray; namespaceForOxford { classProgram { staticvoid Main(string[] args) { PA.InitGPU(); IPA is1 = newIPA(4, newint[] { 1, 2, 3, 4 }); IPA is2 = newIPA(4, newint[] { 5, 6, 7, 8 }); IPA is3 = newIPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (intiin ra1) Console.Write(i + " "); Console.WriteLine(""); } } }

Example: Bitmap Blur(Using Accelerator v1.1.1) usingPA = Microsoft.Research.DataParallelArrays.ParallelArrays; usingFPA = Microsoft.Research.DataParallelArrays.FloatParallelArray; float[,] Blur (float[] kernel) { FPA pa = newFPA(bitmap); // Convolve in X direction FPAresultX = newFPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPAresultY = newFPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result; }

Expression Graphs rX FPA pa = new FPA(bitmap); // Convolve in X direction FPA rX = new FPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { rX += PA.Shift(pa, 0, i) * kernel[i]; } Shift (0,0) Shift (0,1) * pa k[0] + rX + * k[1] + …

classProgram { staticvoid Main(string[] args) { IPA.InitGPU(); • IPA ipa1 = newIPA(5, newint[] {1, 2, 3, 4, 5}) ; • IPA ipa2 = newIPA(5, newint[] {10, 20, 30, 40, 50}) ; • IPA ipa3 = newIPA(5, newint[] {21, 5, 7, 4, 8}); • IPA ipa4 = newIPA(5, newint[] {4, 1, 7, 2, 5}) ; IPAipa5 = newIPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); • IPA result = PA.Multiply (ipa4, • (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (intiin ra1) Console.Write(i + " "); Console.WriteLine(""); } }

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Presentation Transcript

Accelerating DSP Algorithms Using FPGAs

Accelerating UK Nuclear New Build

Session #: Developing SAS Applications using Microsoft .NET

David Stern Ralf Herbrich Thore Graepel Microsoft Research Cambridge, UK

Accelerating Machine Learning Applications using Delite

Replicating Microsoft Applications using EMC RecoverPoint

From Microsoft Research, particularly the Computational Science group in Cambridge, UK

Accelerating Applications using HPC Server 2008

Accelerating PHP Applications

Using pictures in Microsoft applications

Using FPGAs as device

Accelerating PHP Applications

Using pictures in Microsoft applications

Hardware Acceleration of Applications Using FPGAs

Networking Virtualization Using FPGAs

Zhiduo Liu Aaron Severance Satnam Singh Guy Lemieux

Jack Glen IFC Ajit Singh U. Cambridge

Luca Cardelli Microsoft Research Cambridge UK ICSE St. Louis, 2005-05-18

Cédric Fournet Microsoft Research, Cambridge

Satnam Agriculture Works

Luca Cardelli Microsoft Research Cambridge UK ICSE St. Louis, 2005-05-18

Research Languages on the .NET Platform Nick Benton Microsoft Research, Cambridge UK