1 / 26

ARMOR A synchronous R ISC M icroprocess or

ARMOR A synchronous R ISC M icroprocess or. הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות

Download Presentation

ARMOR A synchronous R ISC M icroprocess or

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARMORAsynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Mid-Semester Presentation Submitted by: Tziki Oz-Sinay, Ori Lempel Supervised by: Rony Mitleman

  2. Milestones Reached • Development platform selected • Balsa over Petrify+VHDL • Micro-Architecture Specification (MAS) completed • functional block partition, datapath interface defined • asynchronous handshaking protocol defined • Detailed asynchronous pseudo-code implementation written • Balsa code writing, dynamic simulation and synthesis started

  3. Development Platform Selection Two development enviornments were examined: Balsa: • Language for synthesising large asynchronous circuits and systems • Compiles to a small, parametric, set of handshake components • Balsa flow • Balsa initial flow Petrify: • A synthesis tool for Petri Nets and asynchronous controllers • Reads a Petri Net and generates another Petri Net, which is simpler than the original description but behaviorally similar

  4. Development Platform Selection(cont.) Balsa’s Advantages • One development environment  easier debugging and integration • Synthesis implements a delay-insensitive circuit  implementation is transparent to the developer (no need for timing analysis) • Control channels are automatically created at compilation • High level language  easier to learn

  5. Development Platform Selection(cont.) Petrify’s Advantages • A more mature environment than Balsa • When using Petrify the core of the system is written in VHDL  all of the tools/flows are well known and supported in the lab • Petrify’s output is translated to Verilog, while Balsa only supports EDIF synthesis  higher level output, compatible with Altera

  6. The Balsa Environment Was Chosen This constitutes new hardware requirements: • A simplified design, comprising an in-order pipeline and no external memory will be synthesized on a Xilinx Spartan FPGA • The complete design will later on be implemented on a Xilinx Vertex Pro II

  7. Handshake Protocol Push Channel Pull Channel REQ REQ ACK ACK n DATA n DATA 4 Phase Protocol DATA REQ ACK

  8. Execute Retire Decode Rename Write Back Fetch Instruction Cache Date Cache ARMOR Pipestages Out Of Order Engine BranchDecision Op[3:0] ALU0PDst[3:0] PDst[3:0] ALU0Res[15:0] SrcVal1[15:0] ALU1PDst[3:0] PC[15:0] Op[3:0] SrcVal2[15:0] ALU1Res[15:0] LDst[3:0] LDst[3:0] Imm[11:0] VInst[15:0] Val15:0] LSrc[3:0] Op[3:0] DataIn[15:0] Imm[11:0] Inst[15:0] PDst[3:0] MemPDst[3:0] PDst[3:0] SrcVal1[15:0] DataOut[15:0] Addr[15:0] SrcVal2[15:0] ReadWrite# Imm[11:0]

  9. Instruction Fetch Unit (IFU) • Function: • Fetch instruction pointed to by the PC register from the instruction cache. • Execute the jump instruction. • Calculate branch addresses, speculatively fetch branch target instructions and stall pipeline pending branch decision. branch offset branch decision + PC+2 to instruction cache branch instruction to ID next instruction

  10. Instruction Decoder (ID) • Function: • Tag instructions by type (REGREG, REGIMM, MEM, BRANCH). • Queue up to 4 issue-pending instructions, thus allowing continuous instruction fetching in case instruction issue stalls. V Inst tail head

  11. Register Alias Table (RAT) branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF

  12. Register Alias Table (RAT) • Function: • Register Renaming – map logical sources/ destinations to physical registers (ROB/RRF entries): • Allocate physical destination (PDst) pointers during instruction issue • Reset pointers during retirement (CAM-match logic) • Monitor data-readiness of physical sources/destinations: • Reset ready-bit during instruction issue • Set ready-bit during writeback (CAM-match logic) PDst Ready R0 R1 R2 R3 R4 R5 R6 R7

  13. ReOrder Buffer (ROB) branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF

  14. ReOrder Buffer (ROB) • Structure: Circular buffer of 24 entries (PDsts), each one holding all relevant data for a single instruction: • Op Code and Op Type • LDst • PSrc1 – pointer, value and status • PSrc2 – pointer, value and status (if needed) • Immediate (if needed) • Writeback Result • Dispatched, Valid bits • Large register file: 24 entries * 71 bits/entry = 1704 bits

  15. Function: • Hold all instructions currently in the execution window (issueretirement). • Determine data-readiness of each instruction by CAM-matching WB buses vs. entry’s PSrc pointers. • Dispatch data-ready instructions out-of-order to approriate RS (to be explained…☺). • Retire PDsts of executed instruction in-order to Real Register File (RRF).

  16. Dispatch Algorithm: • 3 independent iterators, scanning the ROB from tail to head: • BranchRS Iterator – searches for the oldest branch instruction yet to be dispatched. • MemRS Iterator – searches for the oldest memory instruction yet to be dispatched. • RegOpRS Iterator – searches for the oldest data-ready non-branch/memory instruction yet to be dispatched. • Iterators’ independence does not cause conflicts  no need for arbitration ! • Problem: unbalanced dispatching can clog one ALU and starve the other, leading to diminished performance.

  17. Dispatch Algorithm (cont.): • Solution: the ROB maintains a load-balance counter, ranging from -4 to 3: • incremented upon branch issue and memory dispatch • decremented upon memory issue and branch dispatch • The RegOpRS Iterator dispatches data-ready instructions according to the following rules:

  18. Reservation Stations (RS) branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF

  19. PDst Branch Op Src1 Src2 V RS0 Op Non Branch/Mem Op Src2 Imm Src1 PDst V Op Src2 Mem Op Imm Src1 PDst V RS1 Op Src2 Non Branch/Mem Op Imm Src1 PDst V Reservation Stations (RS) • Function: • Buffer data-ready instructions for both ALUs, so as to minimize (or even eliminate!) execution idle time • Sort instructions according to type/priority for each ALU: • ALU0 – branch ops vs. non-branch/memory ops • ALU1 – memory ops vs. non-branch/memory ops

  20. ALUs branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF

  21. ALUs • Function: • Continuously execute instructions from respective RS and drive their associated PDsts and results on the WB busses. • Prioritize instructions: • branch ops have precedence over other ops on ALU0  result (branch decision) is driven to IFU • memory ops have precedence over other ops on ALU1  result (address) is driven to DCache

  22. Data Cache branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF

  23. Data Cache • Function: • Read/write memory operands in-order (according to address, RdWr# signal from ALU1) and drive their PDsts (and results, for LW ops) on the WB busses. • Queue up to 4 pending memory access instructions, thus allowing ALU1 to execute successive LW/SW ops without stalling.

  24. Timeline • ASAP (beaurocracy…) • Install Balsa 3.3, including netlist technology, on Lion server • Increase Linux user quotas • Install Exceed terminal server in lab so that we can remotely connect to Lion server • 4/3/04(final report, first semester): • Asynchronous simulation of a complete data-path flow through the pipeline: mov R0, 1 add R0, 1

  25. Balsa Initial Flow

  26. Balsa Flow

More Related