620 likes | 782 Views
Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK. A Heterogeneous Future. Example Speedup: DNA Sequence Matching. Why are regular computers not fast enough?. FPGAs are the Lego of Hardware. multiple independent multi-ported memories. hard and soft
E N D
Accelerating Applications using FPGAsSatnam Singh, Microsoft Research, Cambridge UK
multiple independent multi-ported memories hard and soft embedded processors fine-grain parallelism and pipelining
LUTs are higher order functions i3 i2 i1 i2 i1 o o o i o i1 i0 i0 i0 lut1 lut2 lut3 lut4 inv = lut1 notand2 = lut2 (&&) mux = lut3 (ls d0 d1 . if s then d1 else d0)
FPGAs as Co-Processors XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940
scientific computing data mining search image processing financial analytics opportunity challenge
Fibonacci Example 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...
entity fib is port (signalclk, rst : in bit ; signalfibnr : out natural) ; endentityfib ; architecture behavioural of fib is signallastFib, currentFib : natural ; begin compute_fibs : process begin waituntilclk'eventandclk='1' ; ifrst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; endif ; end process compute_fibs ; fibnr <= currentFib ; end architecture behavioural ;
FPGA hardware (VHDL) GPU code (Accelerator) data parallel descriptions C++ SMP
Kiwi gate-level VHDL/Verilog Kiwi C-to-gates structural parallel imperative imperative (C) thread 1 ; ; thread 3 ; thread 2 jpeg.c
Kiwi Library circuit model Kiwi.cs JPEG.cs Visual Studio Kiwi Synthesis multi-thread simulation debugging verification circuit implementation JPEG.v
circuit C to gates Thread 1 parallel program circuit C to gates Thread 2 C# C to gates circuit Thread 3 circuit Thread 3 C to gates Verilog for system
Our Implementation • Use regular Visual Studio technology to generate a .NET IL assembly language file. • Our system then processes this file to produce a circuit: • The .NET stack is analyzed and removed • The control structure of the code is analyzed and broken into basic blocks which are then composed. • The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.
System Composition • We need a way to separately develop components and then compose them together. • Don’t invent new language constructs: reuse existing concurrency machinery. • Adopt single-place channels for the composition of components. • Model channels with regular concurrency constructs (monitors).
Writing to a Channel publicclassChannel<T> { T datum; bool empty = true; publicvoid Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }
Reading from a Channel public T Read() { T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r; }
user applications domain specific languages rendezvous join patterns transactional memory data parallelism systems level concurrency constructs threads, events, monitors, condition variables
classFIFO2 { [Kiwi.OutputWordPort(“result“, 31, 0)] publicstaticint result; staticKiwi.Channel<int> chan1 = newKiwi.Channel<int>(); staticKiwi.Channel<int> chan2 = newKiwi.Channel<int>();
publicstaticvoid Consumer() { while (true) { inti = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } } publicstaticvoid Producer() { for (inti = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }
publicstaticvoid Behaviour() { ThreadProducerThread = newThread(newThreadStart(Producer)); ProducerThread.Start(); ThreadConsumerThread = newThread(newThreadStart(Consumer)); ConsumerThread.Start();
Filter Example thread one-place channel
publicstaticint[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = newint[size]; int[] result = newint[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (inti = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }
staticvoidTap(inti, bytew, • Kiwi.Channel<byte> xIn, • Kiwi.Channel<int> yIn, • Kiwi.Channel<int> yout) • { • bytex; • int y; • while(true) • { • y = yIn.Read(); • x = xIn.Read(); • yout.Write(x * w + y); • } • }
Inter-thread Communication and Synchronization // Create the channels to link together the taps for (int c = 0; c < size; c++) { Xchannels[c] = newKiwi.Channel<byte>(); Ychannels[c] = newKiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros }
// Connect up the taps for a transposed filter for (inti = 0; i < size; i++) { • int j = i; // Quiz: why do we need the local j? ThreadtapThread = newThread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start(); }
using System; usingSystem.Collections.Generic; usingSystem.Text; usingMicrosoft.Research.DataParallelArrays; usingPA = Microsoft.Research.DataParallelArrays.ParallelArrays; usingIPA = Microsoft.Research.DataParallelArrays.IntParallelArray; namespaceForOxford { classProgram { staticvoid Main(string[] args) { PA.InitGPU(); IPA is1 = newIPA(4, newint[] { 1, 2, 3, 4 }); IPA is2 = newIPA(4, newint[] { 5, 6, 7, 8 }); IPA is3 = newIPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (intiin ra1) Console.Write(i + " "); Console.WriteLine(""); } } }
Example: Bitmap Blur(Using Accelerator v1.1.1) usingPA = Microsoft.Research.DataParallelArrays.ParallelArrays; usingFPA = Microsoft.Research.DataParallelArrays.FloatParallelArray; float[,] Blur (float[] kernel) { FPA pa = newFPA(bitmap); // Convolve in X direction FPAresultX = newFPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPAresultY = newFPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result; }
Expression Graphs rX FPA pa = new FPA(bitmap); // Convolve in X direction FPA rX = new FPA(0, pa.Shape); for (inti = 0; i < kernel.Length; i++) { rX += PA.Shift(pa, 0, i) * kernel[i]; } Shift (0,0) Shift (0,1) * pa k[0] + rX + * k[1] + …
classProgram { staticvoid Main(string[] args) { IPA.InitGPU(); • IPA ipa1 = newIPA(5, newint[] {1, 2, 3, 4, 5}) ; • IPA ipa2 = newIPA(5, newint[] {10, 20, 30, 40, 50}) ; • IPA ipa3 = newIPA(5, newint[] {21, 5, 7, 4, 8}); • IPA ipa4 = newIPA(5, newint[] {4, 1, 7, 2, 5}) ; IPAipa5 = newIPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); • IPA result = PA.Multiply (ipa4, • (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (intiin ra1) Console.Write(i + " "); Console.WriteLine(""); } }