1 / 26

HPEC 2012

Fast Functional Simulation with a Dynamic Language Craig S. Steele, Exogi LLC, USA JP Bonn, Exogi LLC, USA. HPEC 2012. Fast Functional Simulation with a Dynamic Language Craig S. Steele and JP Bonn Exogi LLC Las Vegas, NV, USA. Presentation Outline. Motivation for Fast Functional Simulator

chinue
Download Presentation

HPEC 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Functional Simulation with a Dynamic LanguageCraig S. Steele, Exogi LLC, USAJP Bonn, Exogi LLC, USA

  2. HPEC 2012 Fast Functional Simulation with a Dynamic Language Craig S. Steele and JP Bonn Exogi LLCLas Vegas, NV, USA

  3. Presentation Outline • Motivation for Fast Functional Simulator • Programmable Components • Static versus Dynamic Simulators • Dynamic Binary Translation • Lua Scripting Language • Dynamic Language Tracing JIT compilers • Instruction Pipeline in Software • Simulating Heterogeneous Devices

  4. Motivation for Fast Functional Sim • Larger systems (including SoCs, FPGAs) • Many programmable components • Non-programmable have functional spec • Lower-level simulations slow, getting slower • Waterfall (or over-the-wall) HW/SW flow • Software is late from the start, unintegrated • Fast functional sim enables SW developers • Future resilient systems need autonomic SW

  5. Static vs. Dynamic ISS • Fetch-decode-execute iterative interpreter • 10+ MIPS • SW analog of a non-pipelined CPU • Pre-decode binary instructions • Convert each instruction to “C” function call • Compile C function call to native code • Do more of the above at runtime • Support JIT and self-modifying code

  6. Dynamic Binary Translation • Dynamic Binary Translator (DBT) does it all • QEMU (Quick EMUlator) Fabrice Ballard • Hundreds rather than tens of simulated MIPS • Basic blocks (BBs), not single instructions • Can be used for whole-system simulation • Mainly PC-oriented virtual machines • Also some embedded non-PC systems • Support for several instruction sets

  7. Code Generation DSL • Model target instruction set with “micro-ops” • Tiny Code Generator (TCG) • Second Generation DSL • Another app-specific language to remember • Each instruction set is a different port • Supportive user community is tiny • Simulator code is very different from TCG • New device may need new TCG micro-ops • Code differs for prog & non-prog devices

  8. Static vs. Dynamic Languages • Static Languages: C / C++ • Variables have a type • Compiler detects type mismatches • Compiled separately from execution • Optimize lots of possible paths • Dynamic Languages • Data is typed, variable can hold any datum • Type is detected at runtime • “Scripting” languages are often interpreted

  9. Lua Language • Small, multi-paradigm “scripting” language • Supports a functional programming style • First-class functions & closures • Small conceptual and resource footprint • Well under 300KB memory footprint • Good embeddability and portability • Good interface to C

  10. Lua Language Quirks • Only composite data type is table • Associative array • Supports both integer and hashed keys • Limited native data types • “Number” type: double or dual double/integer • Dynamic type inference/specialization • No standardized class-based object system • Many object systems, may not intermix • Tables & closures reduce centrality of classes

  11. LuaJIT Compiler • Tracing Just-in-Time compilers • Excellent for dynamic languages • Data types can be observed at runtime • Optimization and specialization of types • Source compiled to register-based bytecode • Fast assembly-coded interpreter for tracing • “Hot code” chunks optimized to native binary • Excellent C Foreign-Function Interface (FFI)

  12. C One-Instr. Fetch-Execute Loop for (int i=0; i < nIters; i++) { #if TRACE printf("%08d| PC = 0x%08X\n", i, node->curPC); if (node->curPC == pausePC) pauseCount++; #endif uint32_t deltaPC = sfp((void *)di, (void *)node); node->curPC = node->nextPC; di = virtual_icache_entry(node->nextPC,node); sfp = di->fPtr; node->nextPC += deltaPC; }

  13. Lua One-BB Fetch-Execute Loop for i = 1, n_BBs do -- 140 MOps vs 572 MOps with single-entry cache if cur_pc ~= next_pc then -- avoid table lookup if no change cur_pc = next_pc cur_instr = loc_imem[cur_pc] end next_pc = cur_instr(cpu) end

  14. Closures as Object Substitutes 1 -- Generated generic code for all “xor” instructions -- Anonymous function is a “factory” for XOR/XORI instr. instances -- When factory function is called, function arguments are persistently bound to values return function(seq_instr,seq_pc,br_pc,cpu,op_fcns,store_RX,valA,valB) local result = op_fcns.lm32_alu_fcn_xor(valA,valB) return store_RX(seq_instr,seq_pc,br_pc,cpu,result) end

  15. Closures as Object Substitutes 2 -- Generated generic code for all “bge” branch instructions -- Anonymous function is a “factory” for BGE/BGEU instr. instances -- When factory function is called, function arguments are persistently bound to values return function(seq_instr,seq_pc,br_pc,cpu,op_fcns,select_pc,valA,valB) local result = op_fcns.lm32_cond_fcn_bge(valA,valB) local next_pc = select_pc(seq_pc,br_pc,result) return next_pc -- Dynamic return types: (integer | function ref) end

  16. Tail-Call Optimization • Where does “TCOSim” name come from? • TC: if last statement in a function is a fcn call • A tail call is a GOTO for functional languages • Tail-call Optimization (TCO) • Recursion without a stack • Multiple functions can be optimized as one • LuaJIT tracing compiler optimizes calls away • Functions are organizational, not executable

  17. Pipelined Instruction Definition Example: 3-operand RISC ALU instruction class local function lm32_build_RR_format_instr(format,instr) … local load_valB = lm32_load_RZ[instr[4]] … return load_valB(seq_instr, seq_pc,cpu,lm32_ops,store_val,alu_op,load_valA) …return load_valA (seq_instr, … … return alu_op(seq_instr, … … return seq_instr(cpu) Example: Helper ALU operation shared by XOR/XORI local function lm32_alu_fcn_xor(x,y) local result = bit_bxor(x + y) return result end

  18. Basic Block Instruction Fusion • Non-branching instructions end with tail call return seq_instr(cpu) • Branch instructions return new PC lm32_operation_builder_string["BI"] = [[ return function(seq_instr, seq_pc,br_pc,cpu,op_fcns,select_pc,valA,valB) local result = op_fcns.lm32_cond_fcn_%s(valA,valB) local next_pc = select_pc(seq_pc,br_pc,result) return next_pc end ]]

  19. Basic C vs. TCOSim ISS Speed C-based Single-Instruction Sim: Tens of MIPS

  20. Simulating Heterogeneous Devices • Simulator calls a chain of instructions • Basic block, ending with branch • Device/CPU is a parameter • CPU or other programmed HW • With HDL design, most HW is a program • Devices can be intermixed in BB chain • Multiple CPU HW threads (or hetero CPUs) • DMA devices • Event polling devices, e.g., interrupt control

  21. Future Directions • Integrate gdb debug server • Compile existing C server code for target • Run gdb server as “virtual” thread on sim • Should be fairly portable to real HW • Support external events, e.g., interrupts • Interleave polling instruction blocks • Asynchronous callback to inner loop • Multi-threaded simulation host support • Self-host LuaJIT compiler within sim target

  22. Conclusion • Lua: small dynamic scripting language • Easy dynamic-code-creating programs • Fast edit-run-expletive cycle • Can be fast with a tracing JIT compiler • Surprisingly good fit for functional simulation • Loose typing good for heterogenous devices • Conventional coding, not exotic ISS DSL • Small enough and fast enough to go meta

  23. Meta-Meta http://xkcd.com/917/

  24. end --Thank You! Craig Steele, steele@exogi.com

  25. Caching in Lua • Lua has one primitive composite data type • Table is associative array • Almost any data type can be used as index • Integers are optimized like C arrays • Other index types are used for hashtable • Tables are both arrays and hashtables • Meta-tables allow custom index operations • Using one table to cache another is very easy

  26. A Use Case of Caching in Lua • Instruction-memory decoding can be either • Static: binary available at start, unchanging • Dynamic: binary is JITed or self-modifying • Step 0 – Get binary encoded instruction • Step 1 – Decode to table, e.g., {"add”,1,0,21} • Step 2 – Build instruction as stack of Lua fcns • Step 3 – Link instructions into basic blocks • Step 4 – Locate BBs at absolute addresses • Every step can be cached and “on demand”

More Related