1.46k likes | 1.59k Views
Eliminating the Hardware/Software Divide. Satnam Singh, Microsoft Research Cambridge, UK. !. IRQ, NMI. t. locks monitors condition variables spin locks priority inversion. multiple independent multi-ported memories. hard and soft embedded processors. fine-grain parallelism and
E N D
Eliminating the Hardware/Software Divide Satnam Singh, Microsoft Research Cambridge, UK
! IRQ, NMI
locks monitors condition variables spin locks priority inversion
multiple independent multi-ported memories hard and soft embedded processors fine-grain parallelism and pipelining
LUTs are just higher order functions i3 i2 i1 i2 i1 o o o i o i1 i0 i0 i0 lut3 lut1 lut2 lut4 inv = lut1 notand2 = lut2 (&&) mux = lut3 (ls d0 d1 . if s then d1 else d0)
14820 sim-adds 1,037,400,000,000 additions/second 32-bit integer Adder (32/474,240) >700MHz 332x1440 XC6VLX760 758,784 logic cells, 864 DSP blocks, 1,440 dual ported 18Kb RAMs
XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940
Case Study – Spam Filtering (Alessandro Forin, MSR Redmond) • Benchmark • ~50,000 regular expressions fromForefront Team (snapshot fromtheir Exchange server in Aug ‘09) • Performance • Up to 6000x faster than standard Intel processors • Capable of processing at line rate of gigabit Ethernet • Power Requirement • 7 – 10 watts rather than 200++ watts
Software Version FPGA Version “E-mail Server” “E-mail Server” ~1 Message/Sec ~6000 Messages/Sec Reg Ex Processing <10 Watts 200++ Watts Reg Ex Processing
René Müller (ETH) FPGAs + SQL [VLDB]
CPU FPGA
541 seconds 1896 seconds
scientific computing data mining search image processing financial analytics opportunity challenge
publicstaticint[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = newint[size]; int[] result = newint[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (inti = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }
ray of light Signal Esterel SHIM Accelerator RapidMind /Ct Streams-C Bluespec Liquid Metal Feldspar PRET-C
embedded DSLs high level software machine learning universal language? GPU FPGA DSP Gannet grand unification theory polygots
Our High Level Synthesis Projects Kiwi: concurrent C# programs for control-oriented applications [David Greaves,Univ. Cambridge] shape analysis: synthesis of dynamic data structures (C) [MPI and CMU] Accelerator/FPGA: synthesis of data parallel programs in C++/C#/F# [MSR Redmond] HLINQ eDSLs [Gavin Bierman] + compilation of self-recursive Haskell functions to FPGA circuits!
Redmond Accelerator Team Barry Bond Kerry Hammil Lubomir Litchev <anonymous other person>
Effort vs. Reward CUDA OpenCL HLSL DirectCompute Accelerator low effort low reward medium effort medium reward high effort high reward