Eliminating the Hardware/Software Divide

Eliminating the Hardware/Software Divide Satnam Singh, Microsoft Research Cambridge, UK

! IRQ, NMI

locks monitors condition variables spin locks priority inversion

multiple independent multi-ported memories hard and soft embedded processors fine-grain parallelism and pipelining

LUTs are just higher order functions i3 i2 i1 i2 i1 o o o i o i1 i0 i0 i0 lut3 lut1 lut2 lut4 inv = lut1 notand2 = lut2 (&&) mux = lut3 (ls d0 d1 . if s then d1 else d0)

14820 sim-adds 1,037,400,000,000 additions/second 32-bit integer Adder (32/474,240) >700MHz 332x1440 XC6VLX760 758,784 logic cells, 864 DSP blocks, 1,440 dual ported 18Kb RAMs

XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940

Case Study – Spam Filtering (Alessandro Forin, MSR Redmond) • Benchmark • ~50,000 regular expressions fromForefront Team (snapshot fromtheir Exchange server in Aug ‘09) • Performance • Up to 6000x faster than standard Intel processors • Capable of processing at line rate of gigabit Ethernet • Power Requirement • 7 – 10 watts rather than 200++ watts

Software Version FPGA Version “E-mail Server” “E-mail Server” ~1 Message/Sec ~6000 Messages/Sec Reg Ex Processing <10 Watts 200++ Watts Reg Ex Processing

René Müller (ETH) FPGAs + SQL [VLDB]

CPU FPGA

541 seconds 1896 seconds

scientific computing data mining search image processing financial analytics opportunity challenge

The Accidental Semi-colon ;

publicstaticint[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = newint[size]; int[] result = newint[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (inti = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

PLDI 1998

PLDI 2003

PLDI 2010

POPL 1998

POPL 2002

POPL 2010

ray of light Signal Esterel SHIM Accelerator RapidMind /Ct Streams-C Bluespec Liquid Metal Feldspar PRET-C

embedded DSLs high level software machine learning universal language? GPU FPGA DSP Gannet grand unification theory polygots

Our High Level Synthesis Projects Kiwi: concurrent C# programs for control-oriented applications [David Greaves,Univ. Cambridge] shape analysis: synthesis of dynamic data structures (C) [MPI and CMU] Accelerator/FPGA: synthesis of data parallel programs in C++/C#/F# [MSR Redmond] HLINQ eDSLs [Gavin Bierman] + compilation of self-recursive Haskell functions to FPGA circuits!

Redmond Accelerator Team Barry Bond Kerry Hammil Lubomir Litchev <anonymous other person>

Effort vs. Reward CUDA OpenCL HLSL DirectCompute Accelerator low effort low reward medium effort medium reward high effort high reward

Accelerator

Eliminating the Hardware/Software Divide