Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology Many slides produced by: Arvind, Myron King, Man Cheuk Ng, AngshumanParashar http://csg.csail.mit.edu/6.375

Hello, world! module mkHello#(TOP_LEVEL_WIRES wires); CHANNEL_IFC channel <- mkChannel(wires); // has a software counterpart Reg#(Bit#(8)) count <- mkReg(0); Reg#(Bit#(5)) state <- mkReg(0); rule init (count == 0); count <- channel.recv(); state <= 0; endrule rule hello (count != 0); case (state) 0: channel.send(‘H’); 1: channel.send(‘e’); 2: channel.send(‘l’); 3: channel.send(‘l’); ... 16: count <= count – 1; endcase if (state != 16) state <= state + 1; else state <= 0; endrule endmodule int main (intargc, char* argv[]) { int n = atoi(argv[1]); for (inti = 0; i< n; i++) { printf(“Hello, world!\n”); } return 0; } http://csg.csail.mit.edu/6.375

Today’s Lecture • Case Study: IMDCT • Interfacing with HW • Extracting Parallelism • Automated Solutions • Bluespec Inc.: SCE-MI • Intel/MIT: LEAP RRR http://csg.csail.mit.edu/6.375

Ogg Vorbis Pipeline Bits Stream Parser • OggVorbis is a audio compression format roughly comparable to other compression formats: e.g. MP3, AAC, MWA. • Input is a stream of compressed bits • Parsed into frame residues and floor “predictions” • The summed frequency results are converted to time-valued sequencies • Final frames are windows to smooth out irregularities • IMDCT takes the most computation Residue Decoder Floor Decoder IMDCT Windowing PCM Output http://csg.csail.mit.edu/6.375

IMDCT Suppose we want to use hardware to accelerate FFT/IFFT computation Array imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; } // do the IFFT vifft = ifft(2*N, vin); http://csg.csail.mit.edu/6.375

IMDCT Array imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // call the hardware vifft = call_hw(2*N, vin); // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; } • Implement or find a hardware IFFT • How will the HW/SW communication work? • How do we explore design alternatives? // do the IFFT vifft = ifft(2*N, vin); http://csg.csail.mit.edu/6.375

HW IFFT Accelerator 1 HW IFFT Accelerator 2 HW Accelerator in a system • Communication via bus • DMA transfer? • Accelerators are all multiplexed on bus • Possibly introduces conflicts • Fair sharing of bus bandwidth Software CPU Bus (PCI Express) http://csg.csail.mit.edu/6.375

setSize inputData outputData The HW Interface • SW calls turn into a set of memory-mapped calls through Bus • Three communication tasks • Set size of IFFT • Enter data stream • Take output out Bus (PCI Express) http://csg.csail.mit.edu/6.375

Data Compatibility Issue IFFT takes Complex fixed point numbers. How do we represent such numbers in C and in RTL? template <typename F, typename I> struct FixedPt{ F fract; I integer; }; template <typename T> struct Complex{ T rel; T img; }; C++ typedefstruct { bit [31:0] fract; bit [31:0] integer; } FixedPt; typedefstruct { FixedPtrel; FixedPtimg; } Complex_FixedPt; Verilog http://csg.csail.mit.edu/6.375

Data Compatibility Let us assume that data compatibility issue have been solved and focus on control issues • Keeping HW and SW representation is tedious and error prone • Issues of endianness (bit and byte) • Layout changes based on C compiler • (gcc vs. icc vs. msvc++) • Some SW representation do not have a natural HW analog • What is a pointer? Do we disallow passing trees and lists directly? • Ideally translation should be automatically generated http://csg.csail.mit.edu/6.375

First Attempt at Acceleration Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; } Sets size pcie_ifc.setSize(2*N); for(i = 0; i < 2*N; i++) pcie_ifc.put(vin[i]); for(i = 0; i < 2*N; i++) vifft[i] = pcie_ifc.get(); Sends 1 element Gets 1 element Software blocks until response exists http://csg.csail.mit.edu/6.375

Exposing more details //mem-mapped hw register volatile int* hw_flag = … //mem-mapped hw frame buffer volatile int* fbuffer = … Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ … assert(*hw_flag== IDLE); for(cnt = 0; cnt<n; cnt++) *(fbuffer +cnt)= frame[cnt]; *hw_flag = GO; while(*hw_flag != IDLE) {;} for(cnt = 0; cnt<n*2; cnt++) frame[cnt++]=*(fbuffer+cnt); … } What happens if SW has a cache? http://csg.csail.mit.edu/6.375

Issues • Are the internal hardware conditions exposed correctly by the hw_flag control register? • Blocking SW is problematic: • Prevents the processor from doing anything while the accelerator is in use • Hard to pipeline the accelerator • Does not handle variation in timing well http://csg.csail.mit.edu/6.375

Driving a Pipelined HW … intpid = fork(); if(pid){ // producer process while(…) { … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); } } else { // consumer process while(…){ for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … } } • Multiple processes exploit pipeline parallelism in the IFFT accelerator. • How does the BSV exert back pressure on the producer thread? • How does the consumer thread exert back pressure on the BSV module? • What if our frames are really large, could the HW begin working before the entire frame is transmitted? http://csg.csail.mit.edu/6.375

Data Parallelism 1 … SyncQueue<Complex<…>> workQ(); intpid = fork(); // both threads do same work while(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … } • How do we isolate each thread’s use of the HW accelerator? • Do two synchronization points (workQ and the HW accelerator) cause our design to deadlock? http://csg.csail.mit.edu/6.375

Data Parallelism 2 PCIE get_hw(intpid){ if(pid==0) return pcieA; else return pcieB; } • By giving each thread its own HW accelerator, we have further increased data parallelism • If the HW is not the bottleneck this could be a waste of resources. • Do we multiplex the use of the physical BUS between the two threads? … SyncQueue<Complex<…>> workQ(); intpid = fork(); // both threads do same work while(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) get_hw(pid).put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = get_hw(pid).get(); … } http://csg.csail.mit.edu/6.375

Multithreading without threads or processies inticnt, ocnt = 0; Complex iframe[sz]; Complex oframe[sz]; … // IMDCT loop while(…){ … // producer “thread” for(i = 0; i<2,icnt<n; i++) if(pcie.can_put()) pcie.put(iframe[icnt++]); // consumer “thread” for(i = 0; i<2,ocnt<n*2; i++) if(pcie.can_get()) oframe[ocnt++]= pcie.get(); … } • Embedded execution environments often have little or no OS support, so multithreading must be emulated in user code • Getting the arbitration right is a complex task • All existing issues are compounded with the complexity of the duplicated states for each “thread” http://csg.csail.mit.edu/6.375

The message • Writing SW which can safely exploit HW parallelism is difficult… • Particularly difficult if shared resources (e.g. bus) are involved Need for automated solutions doing a good job http://csg.csail.mit.edu/6.375

Today’s Lecture • Case Study: IMDCT • Interfacing with HW • Extracting Parallelism • Automated Solutions • Bluespec Inc.: SCE-MI • Intel/MIT: LEAP RRR http://csg.csail.mit.edu/6.375

Bluespec Co-design: SCE-MI • Circuit verification is difficult • Billions of cycles of gate-level simulation • How do we retain cycle accuracy? • Use SCE-MI *Target: WiFi Transceiver http://csg.csail.mit.edu/6.375

SCE-MI • Use gated clocks to preserve cycle-accuracy • Circuit internals run at “Model Clock” • “Model Clock” ticks only when inputs and outputs to the circuit stabilize • Another Co-design problem http://csg.csail.mit.edu/6.375

Bluespec SCE-MI • Used already in Lab • With a controlled clock on the FPGA • Bluespec has a rich SCE-MI library • Get/Put transactors provided • User provides C++ and HW transactors for exotic interfaces http://csg.csail.mit.edu/6.375

Intel/MIT: LEAP RRR • Asynchronous Remote Request-Response stack for FPGA • Uses common Client/Server paradigm • Similar in many respects to Bluespec SCE-MI • Constrained user interface • Open, many platforms supported http://csg.csail.mit.edu/6.375

client get put enable ready data data ready enable req_t resp_t enable ready data data ready enable get put server Client/Server interfaces • Get/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response; endinterface interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response; endinterface http://csg.csail.mit.edu/6.375

RRR Specification Language // ---------------------------------------- // create a new service called ISA_EMULATOR // ---------------------------------------- service ISA_EMULATOR { // -------------------------------- // declare services provided by CPU // -------------------------------- server CPU <- FPGA; { method UpdateRegister(in REG_INDEX, in REG_VALUE); method Emulate(in INST_INFO, out INST_ADDR); }; // --------------------------------- // declare services provided by FPGA // --------------------------------- server FPGA <- CPU; { method SyncRegister(in REG_INDEX, in REG_VALUE); }; }; http://csg.csail.mit.edu/6.375

LEAP Abstraction Layers: RRR Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

LEAP Abstraction Layers: RRR RRR specification files Client Stub Server Stub RRR Client/Server Manager RRR Client/Server Manager Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

LEAP Abstraction Layers: RRR ClientStubs.ISA_EMULATORiemu; ... ... iemu.UpdateRegister.Request( REG_R27, regFile[REG_R27]); ... ... iemu.Emulate.Request(inst); ... ... tgtPC<- iemu.Emulate.Response(); ISA_EMULATOR::UpdateRegister( REG_INDEX i, REG_VALUE v) { regFile[i] = v; } ISA_EMULATOR::Emulate( INST_INFO inst) { // emulate the instruction return target_PC; } User Code User Code Client Stub Server Stub RRR Client/Server Manager RRR Client/Server Manager Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

LEAP Abstraction Layers: RRR User Application Stub Stub Stub Stub Stub Stub RRR Client/Server Manager RRR Client/Server Manager Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

Conclusion • Writing SW which can safely exploit HW parallelism is difficult… • Several automated tools are available • Development ongoing http://csg.csail.mit.edu/6.375

Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab