380 likes | 649 Views
HAsim FPGA-Based Processor Models: Basic Models. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Managing Complexity. How do we reduce the work of writing software models? Re-use components Split functional & timing models
E N D
HAsim FPGA-Based Processor Models: Basic Models Michael AdlerElliott FlemingMichael PellauerJoel Emer
Managing Complexity • How do we reduce the work of writing software models? • Re-use components • Split functional & timing models • Model only what is necessary to compute target performance • Limit data manipulation in timing models, e.g.: • Manage cache tags but not data • Let functional model handle DEC and EXE details • Same on FPGA, plus: • Use software for operations that don’t affect model speed
Functional Model • ISA semantics only – no timing • Similarities to software models: • ISA-agnostic code implements core functions: • Register file • Virtual to physical address translation • Store buffer • ISA-specific code: • Decode • Execute • HAsim currently implements Alpha and SMIPS
Functional Model is Hybrid FPGA / Software • Difficult but infrequent tasks do not need to be accelerated • HAsim uses gem5 for: • Loading programs • Target machine address space management • Emulation of rare & complex instructions
Functional Model is Latency Insensitive (LI) • LI design enables hybrid FPGA / software implementation • Functional memory is cached on FPGA, homed on host • Emulate system calls (gem5 user-mode) in software • Area-efficientFPGA implementation • Store buffer uses a hash table instead of a CAM • Choose number of register-file read ports to handle the common case • Serialized rewind
Time Functional Memory Mem Read Interface Private Mem Cache Private Mem Cache Mem Read Interface Hit Miss Central Cache Central Cache Hit FPGA Miss Mem Interface Mem Interface Memory (m5) Software
Reducing Model Complexity: Shared Functional Model Functional Pipeline • Similar philosophy to software models: • Write ISA functional model once • Functional machine state is completely managed • Timing models can be ISA-independent • Each functional pipeline stage behaves like a request/response FIFO ISPASS 2008 Paper:Quick Performance Models Quickly: Timing-Directed Simulation on FPGAs ITranslate Fetch FunctionalState Decode Execute DTranslate Memory LocalCommit GlobalCommit
FPGA Functional Detail: Token Indexed Scoreboard • Functional model maintains a “token” for each active instruction • Like a pointer, but better on hardware than new, delete and global storage • Token indexed state in functional scoreboard: • Decode information (isLoad, isStore, …) • Architectural to physical register mapping • Memory information • Model verification logic (e.g. prove that an instruction properly flows through all functional stages.)
Functional Model API (Named Connections) • Connection_Client#(FUNCP_REQ_DO_ITRANSLATE, • FUNCP_RSP_DO_ITRANSLATE) linkToITR <- mkConnection_Client("funcp_doITranslate"); • Connection_Client#(FUNCP_REQ_GET_INSTRUCTION, • FUNCP_RSP_GET_INSTRUCTION) linkToFET <- mkConnection_Client("funcp_getInstruction"); • Connection_Client#(FUNCP_REQ_GET_DEPENDENCIES, • FUNCP_RSP_GET_DEPENDENCIES) linkToDEC <- mkConnection_Client("funcp_getDependencies"); • Connection_Client#(FUNCP_REQ_GET_RESULTS, • FUNCP_RSP_GET_RESULTS) linkToEXE <- mkConnection_Client("funcp_getResults"); • Connection_Client#(FUNCP_REQ_DO_DTRANSLATE, • FUNCP_RSP_DO_DTRANSLATE) linkToDTR <- mkConnection_Client("funcp_doDTranslate"); • Connection_Client#(FUNCP_REQ_DO_LOADS, • FUNCP_RSP_DO_LOADS) linkToLOA <- mkConnection_Client("funcp_doLoads"); • Connection_Client#(FUNCP_REQ_DO_STORES, • FUNCP_RSP_DO_STORES) linkToSTO <- mkConnection_Client("funcp_doSpeculativeStores"); • Connection_Client#(FUNCP_REQ_COMMIT_RESULTS, • FUNCP_RSP_COMMIT_RESULTS) linkToLCO <- mkConnection_Client("funcp_commitResults"); • Connection_Client#(FUNCP_REQ_COMMIT_STORES, • FUNCP_RSP_COMMIT_STORES) linkToGCO <- mkConnection_Client("funcp_commitStores"); • Connection_Send#(CONTROL_MODEL_CYCLE_MSG) linkModelCycle <- mkConnection_Send("model_cycle"); • Connection_Send#(CONTROL_MODEL_COMMIT_MSG) linkModelCommit <- mkConnection_Send("model_commits");
Functional: Sample Data Structures • // FUNCP_REQ_DO_ITRANSLATE • typedefstruct • { • CONTEXT_ID contextId; • ISA_ADDRESS virtualAddress; // Virtual address to translate • } • FUNCP_REQ_DO_ITRANSLATE • deriving (Eq, Bits); • // FUNCP_RSP_DO_ITRANSLATE • typedefstruct • { • CONTEXT_ID contextId; • MEM_ADDRESS physicalAddress; // Result of translation. • MEM_OFFSET offset; // Offset of the instruction. • Bool fault; // Translation failure: fault will be raised on • // attempts to commit this token. physicalAddress • // is on the guard page, so it can still be used • // in order to simplify timing model logic. • BoolhasMore; // More translations coming? (The fetch spans two addresses.) • } • FUNCP_RSP_DO_ITRANSLATE • deriving (Eq, Bits);
Key Components • How do I start running? • Signal completion? • Execute instructions?
The “Starter” • Global controller is in software • Orchestrates program loading • Tells local controllers when to begin • Local controllers (FPGA) • Associated with target machine pipeline stage models • LI (like most of the API) LOCAL_CONTROLLER#(MAX_NUM_CPUS) localCtrl <- mkLocalController(inports, outports); let cpu_iid <- localCtrl.startModelCycle(); localCtrl.endModelCycle(cpu_iid, 1);
Timing: Starting the Pipeline • rule stage1_itrReq (True); • // Begin a new model cycle. • let cpu_iid <- localCtrl.startModelCycle(); • linkModelCycle.send(cpu_iid); • debugLog.nextModelCycle(cpu_iid); • …
Must Drive the Functional Pipeline Functional Pipeline ITranslate Fetch FunctionalState Decode Execute DTranslate Memory LocalCommit GlobalCommit
Timing Model Timing Pipeline Functional Pipeline IP • Timing & functional models communicate state using tokens • Minimal timing model: • Only state is IP • Drives a single token at a time ITranslate Fetch FunctionalState Decode Execute Next IP DTranslate Memory LocalCommit GlobalCommit
Timing: Stage 1 (ITranslate) • rule stage1_itrReq (True); • // Begin a new model cycle. • let cpu_iid <- localCtrl.startModelCycle(); • linkModelCycle.send(cpu_iid); • debugLog.nextModelCycle(cpu_iid); • // Translate next pc. • Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid]; • let ctx_id = getContextId(cpu_iid); • linkToITR.makeReq(initFuncpReqDoITranslate(ctx_id, pc)); • debugLog.record_next_cycle(cpu_iid, • $format("Translating virtual address: 0x%h", pc)); • endrule pcPool is the only state in the timing model!
Timing: Fetch • rule stage2_itrRsp_fetReq (True); • // Get the ITrans response started by stage1_itrReq • let rsp = linkToITR.getResp(); • linkToITR.deq(); • let cpu_iid = getCpuInstanceId(rsp.contextId); • debugLog.record(cpu_iid, • $format("ITR Responded, hasMore: %0d", rsp.hasMore)); • // Fetch the next instruction • linkToFET.makeReq(initFuncpReqGetInstruction(rsp.contextId,rsp.physicalAddress, • rsp.offset)); • debugLog.record(cpu_iid, • $format("Fetching physical address: 0x%h, offset: 0x%h",rsp.physicalAddress, rsp.offset)); • endrule
Timing: Execute Response • rule stage5_exeRsp (True); • // Get the execution result • let exe_resp = linkToEXE.getResp(); • linkToEXE.deq(); • let tok = exe_resp.token; • let res = exe_resp.result; • let cpu_iid = tokCpuInstanceId(tok); • Reg#(ISA_ADDRESS) pc = pcPool[cpu_iid]; • // If it was a branch we must update the PC. • case (res) matches • tagged RBranchTaken .addr: • begin • pc <= addr; • end • tagged RBranchNotTaken .addr: • begin • pc <= pc + exe_resp.instructionSize; • end • default: • begin • pc <= pc + exe_resp.instructionSize; • end • endcase • . . .
Timing: Execute (Minimal Requirement) • Send the PC to the functional model • Receive the updated PC from the functional model
Functional: Execute • Read input registers required from token scoreboard • Wait for input registers to have valid values • Functional model supports OOO models but enforces execution in a valid order • Timing model must execute instructions in a valid order or simulation will deadlock • Forward values to an ISA-specific data path • Write result to physical register(s) • Functional model may return without finishing result register write (e.g. floating point emulation) • State returned to the timing model (e.g. branch) resolution must be computed before return
Time Hybrid Instruction Emulation (Infrequent/Complicated Instructions) FPGA Execute Execute FunctionalCache … … Sync Registers Emulate Instruction Sync Registers Emulation Done RRRLayer Write Back orInvalidate Done Write Line Ack MemoryServer EmulationServer EmulationServer Software Functional Instruction Simulator
Unpipelined Model Source • Single source file: unpipelined-pipeline.bsv
Model Performance Goal: Pipeline Parallelism Functional Pipeline IPs • Unpipelined target necessarily serialized functional model calls • Pipelined target could run each stage in parallel • Must solve one problem: how to manage time ITranslate Fetch FunctionalState Decode Execute Next IPs DTranslate Memory LocalCommit GlobalCommit
Managing Time: A-Ports and Soft Connections • FPGA cycles != simulated cycles • HAsim computes target time algorithmically • We are building a timing model, NOT a prototype • 1:ncycle mapping would force us to slow the timing clock to the longest operation, even if it is infrequent • 1:nwould force us either to fit an entire design on the FPGA or synchronize clock domains
curCC Option #1: Global Controller [rejected] Controller • Central controller advances cycle when all modules are ready • Slowest possible cycle no longer dictates throughput • However: • Place & route becomes difficult • Long signal to global controller is on the critical path FET DEC EXE MEM WB
Option #2: A-Ports • Extension of Asimnamed channels • FIFO with user-specified latency and capacity • Pass exactly one message per cycleper port FET DEC EXE MEM WB 1 1 1 1 0 2 • Beginning of model cycle: read all input ports • End of model cycle: write all output ports ISFPGA 2008 Paper:A-Ports: An Efficient Abstraction for Cycle-Accurate Performance Models on FPGAs
Now the Timing Model is LI! • Each target machine pipeline stage may take multiple FPGA cycles • This reduces • Algorithmic complexity • FPGA timing pressure • Less work per cycle • May use FPGA area-efficient code (e.g. linear search replacing a CAM)
Target Machine Model Pipeline Stage Conventions • Separate source module for each target pipeline stage • Private local controller for each module • Local controllers automatically assemble themselves onto a ring • All are managed by the global controller • All external communication is via A-Ports • Functional model API is only exception (we have debated switching to A-Ports)
Example: In-order Decode FPGA Pipeline Stages These FPGA stages represent one target machine cycle: • Receive register scoreboard writebacks from EXE and MEM stages • Consume faults from EXE and COMMIT stages (trigger rewind) • Consume dependence info (scoreboard state and writebacks) • Attempt issue (if data ready and EXE slot available) • Update local state (scoreboard) inorder-decode-stage.bsv
Key Observation: Parallel Target Machine Yields Parallel FPGA Model • Each pipeline stage is a separate module • Only dependence between modules is through A-Ports • A-Ports are parallel • Parallelism in the model is proportional to the parallelism of the target
Pipeline Parallelism Functional Pipeline IPs • Model of a pipelined design naturally runs pipelined on an FPGA • Detailed model of a pipelined design runs faster than a trivial, unpipelined model! ITranslate Fetch FunctionalState Decode Execute Next IPs DTranslate Memory LocalCommit GlobalCommit
What Makes Modeling Caches Hard on FPGAs? • Storage • Not much else – cache management is just a set of pipeline stages
Modeled Caches Are Not as Big as You Might Think • Only model tags – no data in the timing model • Cache model is LI • Connected only by A-Ports • FPGA-latency of the cache tag storage is irrelevant! • Tag storage on the FPGA may be hierarchical • Build FPGA cache to model a target-machine cache, but they are unrelated (LEAP provides automatically generated caches) • Terminology gets messy, but the implementation does not. LEAP scratchpad storage has the same interface as an array.
Multiple Cores and Shared Cache • Later, we will consider multiplexed timing pipelines • On-chip network connecting a shared cache becomes interesting(interleaving varies with OCN topology)