600 likes | 718 Views
MIPS in Haste VLSI Architectures 048878 Final Project. Alon Naveh 056465909 Eitan Zahavi 058742057. Overview. MIPS in Haste is an exercise in Asynchronous design Goal is to practice real world asynchronous design Must contain all elements of the processor
E N D
MIPS in HasteVLSI Architectures 048878Final Project Alon Naveh 056465909 Eitan Zahavi 058742057
Overview • MIPS in Haste is an exercise in Asynchronous design • Goal is to practice real world asynchronous design • Must contain all elements of the processor • Required to show real programs running • No requirement for full ISA • What was achieved ? • A 5 pipe stage design • Hazards detection, bypass and stall • Running Fibonacci and GCD programs • Extensive unit level tests • Learn a lot about Haste • Conclusions were drown A. Naveh, E. Zahavi
Design Challenges • Pipelining • High performance processor requires a pipeline • Stages are concurrent processes • Each stage input from previous, do its work and drive next stage • Write-back needs to loop back into Decode • So Decode must be able to handle two inputs without blocking A. Naveh, E. Zahavi
Design Challenges • Hazards Handling • Pipelined processor naturally encounters a number of hazards • Synchronous Hazard logic uses state information from all stages • Asynchronous Hazard logic must be local to every stage • Read-after-write (RAW) hazard handled by Execute via a local bypass mechanism • Use-after-load hazard by a bypass mechanism from the Write-back stage to the Execute stage • Branch condition calculation hazard solved by stalling the pipe • To minimize the stall the branch condition calculation is in Decode introducing a Stall in the Fetch A. Naveh, E. Zahavi
Design Challenges • Memory Interface • We were required to enable plug-in the asynchronous MIPS in a “synchronous” system • So we could not use Asynchronous memories • We use clocked memories for instructions and data • We assume read valid time < half the clock cycle A. Naveh, E. Zahavi
Supported Instructions • We provided limited instructions • But enough to run reasonable programs: • add, addi, sub, subi, and, or • lw, sw • slt • beq, bnz, bltz, bgez, blez, bgtz • Implemented the unsigned arithmetics A. Naveh, E. Zahavi
Top Level Diagram A. Naveh, E. Zahavi
Top Level Signals A. Naveh, E. Zahavi
Top Level Signals - Cont A. Naveh, E. Zahavi
Fetch Stage – Tasks and Assumptions • Task: Read the next command from the instruction memory • Stores the PC • It calculates the next PC value (by adding 4) • For branches use PC value from decode stage • The memory interface: • Assumes address setup time < clock low period • Assumes memory data out valid time < clock high period A. Naveh, E. Zahavi
Fetch Code • Wait for falling edge of instruction memory clock • Sample the instruction • Send instruction and next PC to the decode stage • Read the possible branch control and address from the decode stage • Calculate next PC value:Take the branch or just use PC+4 • Wait for rising edge of the instruction memory clock A. Naveh, E. Zahavi
Decode Stage – Tasks and Assumptions • Receives inputs from Fetch and Write-Back stages • From Fetch: next command to be executed • From Write-back: Result, register and if to write to • Designed to ensure write-back happens before next instruction registers are read • By counting number of outstanding instruction in next pipe stage • A state variable that counts how many outstanding instructions were sent to the following pipe stages • Can not read next instruction if > 3 instructions are outstanding A. Naveh, E. Zahavi
Decode Code – Without Hazards • If there is data waiting on the channel from the write-back • Write it (if required) to the RF. • Decrement the number of outstanding instructions • If there is new instruction waiting from the fetch stage and the number of outstanding instructions < 3 • Obtain a new instruction from the fetch and decode it into the set of control signals, • register numbers and immediate value (sign extend it). • Lookup the required registers in the RF (serially). • Increment the number of outstanding instructions in parallel to the above. • Send the new command to the execute stage A. Naveh, E. Zahavi
Decode– With Branch/Jump • Additional FIFO store outstanding instruction • If Register will be updated • The register name • On new branch command • The FIFO is checked for a match of branch related registers • If there is a match no new command is fetched • When the registers are provided by Write-Back the branch decision is evaluated and the Fetch is provided with next PC A. Naveh, E. Zahavi
Decode Code - Final • If there is data waiting on the channel from the write-back • Write it (if required) to the RF. • Decrement the number of outstanding instructions • Pop the outstanding commands stack • If there is new instruction waiting from the fetch stage and the number of outstanding instructions < 3 and we are not stalled • Obtain a new instruction from the fetch and decode it into the set of control signals, • register numbers and immediate value (sign extend it). • Increment the number of outstanding instructions in parallel to the above. • Declare a stall • If stalled • Update stall by calling isOutstanding • If the instruction is a branch and not stalled decide if to take the branch or not else no branch is taken. • If no stall Lookup the required registers in the RF (serially). Send the branch decision to the Fetch Send the new command to the execute stage push the dependency into the FIFO A. Naveh, E. Zahavi
Execute – Tasks and Assumptions • Receive the decoded instructions and controls • Perform whatever calculations are needed • Forward results and pipelined controls to memory • Main input pipeline is from the Decode stage • A secondary input pipeline comes from the write-back stage, containing data read from memory A. Naveh, E. Zahavi
Execute Code – Basic Flow • Wait for a new instruction from the d2e pipe • Select the proper input for SrcA port of the ALU (Always SrcA from the d2e pipe – which is the Rs register) • Select the proper input for SrcB port of the ALU. • If Imm input is required (ALUSrcE ==1) then select it for portB; • else select SrcB (which is the Rt register) • Perform the arithmetic or logic operation • Select the Write-back register number from Rt or Rd (RegDstE control) • Output the result to the e2m pipe A. Naveh, E. Zahavi
Execute Block Diagram A. Naveh, E. Zahavi
Execute –Hazards Handling - RAW • RAW is caused by the 3 stage delay required for the Execute result to be written back into the register file • If the current instruction uses a register updated within the last two instruction cycles the data provided by Decode stage is stale • Unlike in a traditionally clocked implementation, RAW logic is done inside the Execute stage • A two-stage bypass FIFO in the Execute holds: • Target register number • A register-write indication • ALU result • Every instruction the source registers are matched to FIFO content • If a match is found, that data from the FIFO data used instead of the Decode data A. Naveh, E. Zahavi
Execute –Hazards Handling - UaL • The use-after-load hazard is caused by 3 cycle delay required for a memory read result to reach back into the register file • Need to use Write-back data of the memory read result • The two stage FIFO is extended to also track memory-read instructions (lw) and results obtained from write-back • The bypass detect a match for memory-reads as well and mux’es the bypass data from the FIFO into the proper ALU input port • A w2e (Write-back to Execute) pipe was added to supply memory-read results to the bypass mechanism • Since the bypass mechanism may not have the memory-read data already available in the bypass FIFO, a Stall mechanism was added to the Execute pipe A. Naveh, E. Zahavi
Execute Code – Complete Flow • W2E pipe is first probed for new memory-read results (the pipe will not send data unless a memory read have completed – which is always true when the pipe is empty). • If a new memory-read result is available, obtain the data from the w2e pipe and then call wb_update to update the FIFO. Then set a “potential_upd” indication for the next execute activities in the flow. • Else continue to the next stage in the flow • If no “Stall” is required, probe if a new instruction is available • If so, obtain the Decode information and set the “new_cmd” flag • Else continue to the next stage • If the new_cmd flag is set and either no “Stall” is asserted or “potential_upd” has been set by the W2E pipe-read, the isEOutstanding task is called to evaluate dependencies and update the Stall condition. • If no “Stall” is required, execution is started: • SrcA mux is activated if isEOutstanding detected a valid bypass source for SrcA • Pre-SrcB mux is activated isEOutstanding detected a valid bypass source for SrcA • The Imm mux is activated for SrcB if required by the instruction controls • Instruction is executed (as before) • Write-back register number is selected (as before) • “new_cmd” flag is cleared • Push_lcl is called to update the bypass FIFO with the current instructions parameters and outcome, and push out the oldest instruction from the 2-entry queue) • The result is outputted to the e2m pipe • Else continue to the beginning of the loop (check for e2m and d2e inputs) • Else continue to the beginning of the loop A. Naveh, E. Zahavi
Memory – Tasks and Assumptions • Provides the read/write address of the data memory and control signals • The read-data is actually accesses by Write-Back stage • This partition allows for both fast and slow memories (compared to the MIPS speed) • Assumes • Memory data valid timing is < clock low time • Data and address setup time < clock high time • With the above assumptions the unit can safely drive the memory at falling edge of clock A. Naveh, E. Zahavi
Memory – Code • Check the e2m pipe for a new instruction • Wait for negedge clk • Drive the Write-enable, Address and Write_data onto the Memory control and input data bus • Wait for posedge_clk • Clear the Write_enable (to prevent false writes in case clk *is* much faster than the memory stage and the following pipe) • Send instruction controls and ALUOut from the previous stage to the Writeback via m2e pipe A. Naveh, E. Zahavi
Write-back – Tasks and Assumptions • The Write-back stage is triggered by the m2w pipe signaling that a new instruction is ready • Its main task is to sample the valid data inputs – from memory or the Memory stage and then to pass it to the Decode stage for register-write. • Due to the memory-read hazard described earlier in the document, Write-back also sends Memory-read results back to the Execution stage along with the register it is targeted at via the w2e pipe. A. Naveh, E. Zahavi
Write-back – Code • Sample the m2w pipe for the control information and ALUOut data (in case of R register-write operations) • Wait for negedge_clk • Capture the memory-read bus • If this is a Memory-read operation (MemToReg ==1) select the memory-output as the writeback data • Else use the previous stage’s ALUOut for writeback data • If a memory read operation, send the data + register-number to the execution unit (w2e pipe) • Send the result to the decode unit via the w2d pipe A. Naveh, E. Zahavi
Design Validation - overview • Design verification was undertaken at two levels • Per stage testing: • Written to test basic functionality of each block • In many cases implemented as interface signal sequencing (and not as assembly code) • Full CPU testing: • Written to test end-to-end operation – including the more complex stalls and bypasses • Uses Assembly language as the source • The methodology included • Creating a top-level in HASTE, encapsulating the asynch portion • A top_name_tb.v test-bench in Verilog, which includes the memories, the clock generation, and in some cases the end-pipes • Outputs: both HASTE waveforms as well as Verilog $monitor printouts were used as appropriate. A. Naveh, E. Zahavi
Fetch stage stand-alone verification • Goals: • test the synchronous “wire” interface to Instruction Memory • Test the f2d channel signaling – activated per instruction fetched • D2f channel – used for BR address feedback into the PC – is not tested in this flow • Mechanism – fetch test-bench: • A top level instantiates the fetch module • Stubs that ack to both f2d and d2f pipe requests sent by the fetch block • A HLV implementation of a memory with read functionality and a clock generator • The content is pre-loaded into memory • The test bench just runs the Fetch unit and tracks the read operations from memory A. Naveh, E. Zahavi
Fetch stand-alone test results • The Memory address increments in 4 sequentially every clock and the relevant instruction is read out • The same data is placed on the f2d channel • Expectation here is that Memory clock (and the synch logic) will be much faster than the MIPS asynch fetch stage operation A. Naveh, E. Zahavi
Decode stand-alone verification • Goals: • Test the Decode Branch Hazard • Expected to stall up to 3 cycles if the registers the BR uses are being written to in the previous 3 instructions (& therefore still in the pipeline) • Expected to stall 1 cycle for a non-dependent BR due to the need to re-load the Fetch with a new address and discard the non-relevant instruction • Non-goals: ensuring correct decode signal generation • Mechanism – top and test-bench overview • A top layer instantiates the Decode instead of exposing it to the test-bench directly • Required since Decode “probes” its w2f and f2d pipes – allowed only for passive pipes at the top level • A fetch stub drives instructions pre-loaded from memory into the f2d pipe • A w2d pipe stub generates dummy writes to reg $1, but at a 3 “stage” delay to emulate a full CPU. This is required to allow the BR dependency stall to operate • A d2e pipe stub just pulls data out of the pipe A. Naveh, E. Zahavi
Decode –stand-alone: cont 00100000000010010000000000001001 // addi $t1, $zero, 9 ($t1 = 9) # create data dependency on t1 00100000000100100000000000010010 // addi $s2, $zero, 18 ($s2 = 18) # create data dependency on s2 00100000000100010000000000010001 // addi $s1, $zero, 17 ($s1 = 17) # create data dependency on s1 00010110010100010000000000000000 // bne $s1, $s2, 0 (if ($s1 != $s2) goto 0) # see a stall for 3 writebacks 00100000000010000000000000001000 // addi $t0, $zero, 8 ($t0 = 8) # create data dependency on t0 00100000000100010000000000010001 // addi $s1, $zero, 17 ($s1 = 17) # create data dependency on s1 00010010010010001111111111111101 // beq $s2, $t0, -3 (if ($s2 == $t0) goto -3) # see a stall for 2 writebacks 00100000000100100000000000010010 // addi $s2, $zero, 18 ($s2 = 18) # create data dependency on s2 00100000000010010000000000001001 // addi $t1, $zero, 9 ($t1 = 9) # create data dependency on t1 00100000000100010000000000010001 // addi $s1, $zero, 17 ($s1 = 17) # create data dependency on s1 00011110010000000000000000000000 // bgtz $s2, 0 (if ($s2 > 0) goto 0) # see a stall for 1 writebacks 00100000000010010000000000001001 // addi $t1, $zero, 9 ($t1 = 9) # create data dependency on t1 00100000000100010000000000010001 // addi $s1, $zero, 17 ($s1 = 17) # create data dependency on s1 00011010010000001111111111111101 // blez $s2, -3 (if ($s2 <= 0) goto -3) # see no stall 00010000000000001111111111111111 // beq $zero, $zero, -1 (if ($zero == $zero) goto -1)# loop forever • Expectations: • 1st BR will stall 3 wrieback cycles ($S1) • 2nd BR will stall 2 writebacks cycles ($T0) • 3rd BR will stall only 1 writeback cycle (no dependencies) • 4th BR does not happen (condition not met) -> no stall • 5th BR will stall 1 cycle (no dependencies) A. Naveh, E. Zahavi
Decode – outputs - 1 • As can be expected, the 1st BR stalls for 3 cycles, and the 2nd BR stalls for 2 cycles A. Naveh, E. Zahavi
Decode – outputs - 2 • The 3rd BR stalls for only one cycle; • The 4th BR does not stall at all (condition not met) • The 5th BR stalls again 1 cycle (BR misprediction penalty) A. Naveh, E. Zahavi
Execute Stand alone testing • Goals: • Test correct execution of add, and, add_imm, sub, and, or, and slt. (all done only for unsigned inputs!) • Test the d2e and e2m pipe operation • Non-goals: hazard handling and w2e pipe testing • Mechanism used • A top layer that instantiates the Execute instead of exposing it to the test-bench directly • Required since Execute “probes” its w2e and d2e pipes – allowed only for passive pipes at the top level • A test bench that pushes new decoded instructions into the d2e pipe and pulls the results from the e2m pipe • Output tracking: Verilog $monitor printing of inputs and outputs A. Naveh, E. Zahavi
Execute – stand alone testing - cont Instructions tested and exepcted results: • // add Rd, Rs, Rt # (Rs == 0xA5A5A5A5, Rt == 0x5A5A5A5A ; Rt = 5, Rd = 0xa) • Expected result: $0xa <= 0xFFFFFFFF • // addi Rt, Rs, 0x25252525 #( Rs == 0xA5A5A5A5; Rt = 5) • Expected result: $0x5 <= 0xCACACACA • // sub Rd, Rs, Rt #( Rs == 0xA5A5A5A5, Rt == 0x5A5A5A5A ; Rt = 3, Rd = 0x11) • Expected result: $0x11 <= 0x4B4B4B4B. • Note: as we implemented only Unsigned arithmetic this is actually executing: “subu Rd, Rs, Rt” • // and Rd, Rs, Rt #(Rs == 0xA5A5A5A5, Rt == 0x5AFF5AFF; Rt = 3, Rd = 0x1F) • Expected result: 0x1F <= 00A5A500A5 • // or Rd, Rs, Rt #(Rs == 0xA5A5A5A5, Rt == 0x5A005A00; Rt = 3, Rd = 0x5) • Expected result: $0x5 <= FFA5FFA5 • // slt Rd, Rs, Rt #(Rs == 0x05A5A5A5, Rt == 0x5A005A00; Rt = 3, Rd = 0x5) • Expected result: $0x5 <=0x00000001 (Rs < Rt) • // slt Rd, Rs, Rt #(Rs == 0x05A5A5A5, Rt == 0x05A5A5A5; Rt = 3, Rd = 0x5) • Expected result: $0x5 <=0x00000000 (Rs !< Rt) A. Naveh, E. Zahavi
Execute – output trace (1) A. Naveh, E. Zahavi
Execute – output trace (2) A. Naveh, E. Zahavi
Memory and Write-back Verification • Goals: • Test memory interfacing and read-write operations • Test e2m, e2w and w2d pipe • Mechanisms used • Since both units in our implementation operate on the Memory module, both are tested together under a single test-bench • A top level layer instantiates memory and writeback modules • A test-bench that includes a stub that drives the e2m with operations and stubs that pulls the results from the w2d and w2e pipes. • Memory and clock are implemented in the test-bench A. Naveh, E. Zahavi
Memory+writeback testing - cont • Test-flow • Initializes memory with mem[i] <= I; • Does a series of writes and reads • // write 0x101 to addr-1 • // write 0x102 to addr-2 • // write 0x104 to addr-4 • // write 0x108 to addr-8 • // write 0x110 to addr-16 • // write 0x120 to addr-32 • // write 0x140 to addr-64 • // write 0x180 to addr-128 • // read addr-0 to reg 8 • // read addr-1 to reg 0 • // read addr-2 to reg 1 • // read addr-4 to reg 2 • // read addr-16 to reg 8 • // read addr-32 to reg 16 • // read addr-64 to reg 31 • // read addr-128 to reg 9 • // write 0xACACACAC to addr-127 • // read addr-63 to reg 10 A. Naveh, E. Zahavi
Memory-WB test outputs - 1 A. Naveh, E. Zahavi
Memory-WB test outputs - 2 A. Naveh, E. Zahavi
Top Level “full chip” testing - overview • This testing level was used to test the more complex functions as well as the full stages interconnect • All Execute bypasses and stalls were tested • Full decode and execute correctness tested • Full writeback to decode data transfer and correct data writes tested • Mechanism • Programs are complied (assembly) and loaded into memory • Test-bench implements the Instruction memory, data memory and the clock • Top level instantiates all of the stages using the pipes and exposes only the memory-interface (link into the synchronous world) • Outputs are tracked via the HASTE viewer and Verilog Monitor command. A. Naveh, E. Zahavi
RAW Hazard testing • Goal: test correct register-write data bypass inside the execute stage by reusing it before it is written back to the register file • Outputs: correct arithmetic result in Result_W bus show bypasses are switched correctly A. Naveh, E. Zahavi
Finobacci full chip • Goal: test UaL hazards and BR dependency hazards as well as some more Decode-execute correctness • Outputs: • Tracking the e2m pipe sequencing for execute stalls • Tracking the d2e pipe for stalls on BR dependencies • Tracking calculation results as written to memory (DM_WR_DATA when DM_WE == ‘1’): should be the Fibonacci sequence starting from the 3rd element (2,3,5,8,13,21 ) A. Naveh, E. Zahavi
Fibonacci full – chip - cont A. Naveh, E. Zahavi
GCD – Full Chip • Goal: test RAW and BR stalls as well as regular execution • Output: • Track memory write of $s0; will have GCD result (‘5’) • Track pipe transfers for stall behavior A. Naveh, E. Zahavi
GCD full chip - outputs A. Naveh, E. Zahavi
Development Environment • Directory Structure • mips – this is the haste design directory holding type.ht, mips.ht, fetch.ht, decode.ht, execute.ht, memory.ht and writeback.ht design files • test – this directory holds the test bench and top level container modules • build – used for building and running the model. Holds the Makefile and run_prog for automating the tests. • Pipe Stages tests • A test-bench named <stage>_top.v is designed per stage test and holds the test for that unit • Top Level (full chip) tests • The top level tests – programs – are stored in files named <test>.mem. • The mips_tb.v is a simple container that loads up the program into the instruction memory and the data into the data memory and then toggle the clock and responds to the MIPS memory interface A. Naveh, E. Zahavi
Development Environment - Cont • Makefile usage • Automating the build and simulation of the code improves productivity. • The Makefile accepts an external variable DES=<top module> and then build and run the simulation for that block. The targets the Makefile provides are: • all – for a complete flow : including simulation and htview • build – compiling and preparing the model without a simulation and htview • clean – for cleaning the directory in cases like changing DES • compile – for just running htcomp • map – for just running htmap • dot – for running htmap to produce a dot file • sim – for invoking Verilog-XL in batch mode • view – for invoking htview • Memory based program execution automation: • A simple script named run_prog was designed • Accepts a single parameter – the name of the memory file containing the program. • Copies it to a local file and invokes the simulator. • After the simulation it calls htview with a parameter to use predefined signals set • Google Open Source Project Hosting environment • Needed to support development by 2 different programmers • Decided to use Google Open Source project hosting. • Allows sharing, tracking and maintaining versions of the source code in its Subversion repository • Provides a Wiki interface for recoding our design decisions and issues. • The project home page can be found at: : http://code.google.com/p/mips-in-haste/ A. Naveh, E. Zahavi
Conclusions • Haste language allows developing an asynchronous design using a high level verilog like environment including a viewer and debugger • Benefits • Asynchronous design using a simplified synchronous-like HDL • Verilog interface as a means for synchronous design integration and testing • The viewer allows to track pipe states and view their internals like a bus A. Naveh, E. Zahavi
Conclusions • Restrictions and issues • Pipe interconnects between “stages” prevents an efficient implementation of global logic(such as hazard resolution) since visibility into blocks is limited • Pipe design proved to be quite complex when waiting on a “potential” input coming in from another stage. Requires delicate use of “probes” and conditional execution. • Debugging operation is not intuitive and quite difficult: • Can not be done outside of Haste viewer since signal naming and hierarchy are not preserved by the compiler • There are no debug hooks to tell what was executed by the machine – this could be solved by the compiler instrumentation I run in debug mode • The viewer did not seem to have search capabilities (find next signal switch, etc) • Haste reference documentation was also quite cumbersome. Real code example were scarce. Description otherwise very “formal” • For example in the Decode and Execute we needed to probe two channels • We could not find documentation precluding the use of two successive “if probe()” or “sel probe()” – however the resulting circuit did not function according to our expectations • The solution was to use a single select statements • Interfacing into a clocked environment requires tricky coding and in essence “synchronizes” whole block execution timing to that clock (not so asynchronous as we’d like to assume) A. Naveh, E. Zahavi