300 likes | 518 Views
30/3/2014. Performed By: Yahel Ben- Avraham and Yaron Rimmer Instructor: Mony Orbach Bi- semesterial , 2012 - 2014. RISC processor implementation using Bluespec part 2 - final presentation. Project goals. Goal: Implementing and analyzing RISC Processor using Bluespec Verilog Part A:
E N D
30/3/2014 Performed By: Yahel Ben-Avraham and YaronRimmer Instructor: MonyOrbach Bi-semesterial,2012 - 2014 RISC processor implementation using Bluespecpart 2 - final presentation
Project goals • Goal: Implementing and analyzing RISC Processor using Bluespec Verilog • Part A: • Studying the working environment, BSV language and the basic processor implementation. • Implementing a simple RISC processor. • Run a simple test bench on the FPGA system.
Project goals • Goal: Implementing and analyzing RISC Processor using Bluespec Verilog • Part B: • Ramp up the design: • Wider instruction set • Branch prediction (and flushing) • Hazard detection unit and extended Data forwarding • Performance counters • Run the design on the FPGA system
Pipeline Datapath Memory Branch Predictor MEM2 WB FETCH DEC EXE MEM1 Instruction Memory Forwarding Register File
Fetch • Tag the instruction’s metadata (PC, cycle) • Fetch the requested instruction from the instruction memory • Update next PC • Get next PC’s branch prediction and branch address • Check for Jump command
Decode • Fully parse the received instruction • Pre-fetch data from registers potentially in use
Execute • According to the instruction’s opcode: • ALU instruction: compute the result • Memory instruction: calculate memory address to read / write to • Branch instruction: check if branch is taken and update branch resolution • Data forwarding
Memory 1 • Send a read / write request to the BRAM • Write : data is immediately stored • Read: wait for response in the next cycle • Otherwise, pass the incoming data
Memory 2 (mem / skipmem) • Implemented in two rules: • For memory read: get BRAM response • Otherwise, pass the incoming struct
Writeback • Save needed data to the register file • Register 0 – read only • Communication with the wrapper • Data and statistics
Branch Prediction • 2-bit saturated, local counter (initialized to WNT) • Prediction is acquired in the Fetch stage • Stored and passed along the pipeline • Branch resolution determined in the Exec stage • BP is updated accordingly • Wrong prediction? • Correction PC • Flushing Dec & Exe
Forwarding • 4 global Forwarding registers • Each containing (when valid) address, value, cycle • Writing - in the end of Exec stage • Reading - in the beginning of Exec stage • Invalidating - by aging after the Exec stage Memory Branch Predictor MEM2 WB FETCH DEC EXE MEM1 Instruction Memory Forwarding Register File
Forwarding – cont. • Special case: register read after memory load • Stalling registers holding the address to be read to • If needed – stall the Exec stage by keeping the current command in the dec/exec FIFO Memory Branch Predictor MEM2 WB FETCH DEC EXE MEM1 Instruction Memory Forwarding Register File
The working environment • Xilinx FPGA development board – of Virtex 5 family • Programming to FPGA using JTAG • Communication with DUT using PCIE • The platform enables: • Synthesis of design to FPGA • Reading and writing to memories • Performance counters
SCEMI’s working methods “Standard Co-Emulation Modeling Interface” • 2 working methods • TCP/IP simulation • FPGA emulation • Establishes port on SW end to FIFO on HW end communication • Parcels (data structs) are delivered in both directions
System layers – PCIE simulation PC FPGA Linux O.S. SCEMI – DUT to PCIE C++ Executable: TB DUT: Wrapper Datapath PCIE Input files
System layers – TCP\IP emulation PC FPGA Linux O.S. SCEMI – DUT to PCIE DUT: Bsim_dut C++ Executable: TB DUT: Wrapper TCP\IP Datapath PCIE Input files
Our SCEMI platform – SW side • A compiled C++ code (TB) is loaded with input files • Sends and receives messages from the DUT using incoming \ outgoing ports • We chose to use a “Stop & Wait” protocol • Performs the following actions: • Loads the DUT’s instruction memory • Loads the DUT’s register file • Signals the DUT to run • When done, collecting relevant information • Register file • Run statistics
Our SCEMI platform – HW side • Our top level module (Wrapper, which is our DUT) • Receiving and sending messages to the TB using FIFOs • Contains the Datapath itself as a black box • Performs commands from the TB • Loads the instruction memory and the register file • Initiates all the registers and starts \ stops the run of the datapath • Receives data from the datapath (from the WB stage) and relay it back to the TB
Putting the design to the test • As a concluding test, we wrote a Bubble Sort in assembly, loading 10 unsorted numbers into the memory, then using bubble sort and displaying them in the register file. • The code uses almostall the instruction set, and practicallyevery feature in thedesign. for (i = 0; i < length -1; ++i) { for (j = 0; j < length - i - 1; ++j) { if (array[j] > array[j + 1]) { inttmp = array[j]; array[j] = array[j + 1]; array[j + 1] = tmp; } } }
Critical example – Bubble sort • The program works successfully in the BSV simulation and the TCP\IP simulation. • Results are incorrect in the PCIE emulation.
Isolating the problem • Trying to isolate the problem – store 4 numbers, and read them into the register file • 4 ADDI , 4 STORE , 4 LOAD • Encountered unexplained yet repeating results • This is only one of many debugging attempts
Isolating the problem • Expected result:consistent with simulation • FPGA result: • Padding with 1 NOP:between ADDI and ST • Padding with 2 or more NOPS:
Further investigation • Dismissing possible issues • Design fault – works flawlessly in simulations • Clearing the design between runs • Investigating xilinx compilation files • Place and route – margins are positive • No note-worthy warnings • Consulting with Danny Hofshi, MonyOrbach, Yuval H.Nacson We were unable to solve the problem.
Problem characterization • PFGA differs in behavior from both BSV and TCP\IP simulation • Related to the Store command – storing into the BRAM memory • Occurs when performing multiples stores in a row • Xilinx reports show no timing warnings
Project usage and integration • The project is designed modularly, so that it can be easily modified and enhanced in the future • “Black Box” design • Integration oriented information and step-by-step walkthrough for using the system in designated section in the project’s final report
Summary and conclusions • Fine line between high- and low- level implementation • Easy to write, modify and understand • Excellent simulation environment • Differences between simulation and FPGA • Automatic optimization – good and bad