260 likes | 383 Views
ARMOR A synchronous R ISC M icroprocess or. הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות
E N D
ARMORAsynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Mid-Semester Presentation Submitted by: Tziki Oz-Sinay, Ori Lempel Supervised by: Rony Mitleman
Milestones Reached • Development platform selected • Balsa over Petrify+VHDL • Micro-Architecture Specification (MAS) completed • functional block partition, datapath interface defined • asynchronous handshaking protocol defined • Detailed asynchronous pseudo-code implementation written • Balsa code writing, dynamic simulation and synthesis started
Development Platform Selection Two development enviornments were examined: Balsa: • Language for synthesising large asynchronous circuits and systems • Compiles to a small, parametric, set of handshake components • Balsa flow • Balsa initial flow Petrify: • A synthesis tool for Petri Nets and asynchronous controllers • Reads a Petri Net and generates another Petri Net, which is simpler than the original description but behaviorally similar
Development Platform Selection(cont.) Balsa’s Advantages • One development environment easier debugging and integration • Synthesis implements a delay-insensitive circuit implementation is transparent to the developer (no need for timing analysis) • Control channels are automatically created at compilation • High level language easier to learn
Development Platform Selection(cont.) Petrify’s Advantages • A more mature environment than Balsa • When using Petrify the core of the system is written in VHDL all of the tools/flows are well known and supported in the lab • Petrify’s output is translated to Verilog, while Balsa only supports EDIF synthesis higher level output, compatible with Altera
The Balsa Environment Was Chosen This constitutes new hardware requirements: • A simplified design, comprising an in-order pipeline and no external memory will be synthesized on a Xilinx Spartan FPGA • The complete design will later on be implemented on a Xilinx Vertex Pro II
Handshake Protocol Push Channel Pull Channel REQ REQ ACK ACK n DATA n DATA 4 Phase Protocol DATA REQ ACK
Execute Retire Decode Rename Write Back Fetch Instruction Cache Date Cache ARMOR Pipestages Out Of Order Engine BranchDecision Op[3:0] ALU0PDst[3:0] PDst[3:0] ALU0Res[15:0] SrcVal1[15:0] ALU1PDst[3:0] PC[15:0] Op[3:0] SrcVal2[15:0] ALU1Res[15:0] LDst[3:0] LDst[3:0] Imm[11:0] VInst[15:0] Val15:0] LSrc[3:0] Op[3:0] DataIn[15:0] Imm[11:0] Inst[15:0] PDst[3:0] MemPDst[3:0] PDst[3:0] SrcVal1[15:0] DataOut[15:0] Addr[15:0] SrcVal2[15:0] ReadWrite# Imm[11:0]
Instruction Fetch Unit (IFU) • Function: • Fetch instruction pointed to by the PC register from the instruction cache. • Execute the jump instruction. • Calculate branch addresses, speculatively fetch branch target instructions and stall pipeline pending branch decision. branch offset branch decision + PC+2 to instruction cache branch instruction to ID next instruction
Instruction Decoder (ID) • Function: • Tag instructions by type (REGREG, REGIMM, MEM, BRANCH). • Queue up to 4 issue-pending instructions, thus allowing continuous instruction fetching in case instruction issue stalls. V Inst tail head
Register Alias Table (RAT) branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF
Register Alias Table (RAT) • Function: • Register Renaming – map logical sources/ destinations to physical registers (ROB/RRF entries): • Allocate physical destination (PDst) pointers during instruction issue • Reset pointers during retirement (CAM-match logic) • Monitor data-readiness of physical sources/destinations: • Reset ready-bit during instruction issue • Set ready-bit during writeback (CAM-match logic) PDst Ready R0 R1 R2 R3 R4 R5 R6 R7
ReOrder Buffer (ROB) branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF
ReOrder Buffer (ROB) • Structure: Circular buffer of 24 entries (PDsts), each one holding all relevant data for a single instruction: • Op Code and Op Type • LDst • PSrc1 – pointer, value and status • PSrc2 – pointer, value and status (if needed) • Immediate (if needed) • Writeback Result • Dispatched, Valid bits • Large register file: 24 entries * 71 bits/entry = 1704 bits
Function: • Hold all instructions currently in the execution window (issueretirement). • Determine data-readiness of each instruction by CAM-matching WB buses vs. entry’s PSrc pointers. • Dispatch data-ready instructions out-of-order to approriate RS (to be explained…☺). • Retire PDsts of executed instruction in-order to Real Register File (RRF).
Dispatch Algorithm: • 3 independent iterators, scanning the ROB from tail to head: • BranchRS Iterator – searches for the oldest branch instruction yet to be dispatched. • MemRS Iterator – searches for the oldest memory instruction yet to be dispatched. • RegOpRS Iterator – searches for the oldest data-ready non-branch/memory instruction yet to be dispatched. • Iterators’ independence does not cause conflicts no need for arbitration ! • Problem: unbalanced dispatching can clog one ALU and starve the other, leading to diminished performance.
Dispatch Algorithm (cont.): • Solution: the ROB maintains a load-balance counter, ranging from -4 to 3: • incremented upon branch issue and memory dispatch • decremented upon memory issue and branch dispatch • The RegOpRS Iterator dispatches data-ready instructions according to the following rules:
Reservation Stations (RS) branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF
PDst Branch Op Src1 Src2 V RS0 Op Non Branch/Mem Op Src2 Imm Src1 PDst V Op Src2 Mem Op Imm Src1 PDst V RS1 Op Src2 Non Branch/Mem Op Imm Src1 PDst V Reservation Stations (RS) • Function: • Buffer data-ready instructions for both ALUs, so as to minimize (or even eliminate!) execution idle time • Sort instructions according to type/priority for each ALU: • ALU0 – branch ops vs. non-branch/memory ops • ALU1 – memory ops vs. non-branch/memory ops
ALUs branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF
ALUs • Function: • Continuously execute instructions from respective RS and drive their associated PDsts and results on the WB busses. • Prioritize instructions: • branch ops have precedence over other ops on ALU0 result (branch decision) is driven to IFU • memory ops have precedence over other ops on ALU1 result (address) is driven to DCache
Data Cache branches non-mem inst mem inst non-branch inst BranchDecision to IFU DATA CACHE ALU0 ALU1 RS0 RS1 Inst from ID ROB RAT RRF
Data Cache • Function: • Read/write memory operands in-order (according to address, RdWr# signal from ALU1) and drive their PDsts (and results, for LW ops) on the WB busses. • Queue up to 4 pending memory access instructions, thus allowing ALU1 to execute successive LW/SW ops without stalling.
Timeline • ASAP (beaurocracy…) • Install Balsa 3.3, including netlist technology, on Lion server • Increase Linux user quotas • Install Exceed terminal server in lab so that we can remotely connect to Lion server • 4/3/04(final report, first semester): • Asynchronous simulation of a complete data-path flow through the pipeline: mov R0, 1 add R0, 1