260 likes | 378 Views
Fast Functional Simulation with a Dynamic Language Craig S. Steele, Exogi LLC, USA JP Bonn, Exogi LLC, USA. HPEC 2012. Fast Functional Simulation with a Dynamic Language Craig S. Steele and JP Bonn Exogi LLC Las Vegas, NV, USA. Presentation Outline. Motivation for Fast Functional Simulator
E N D
Fast Functional Simulation with a Dynamic LanguageCraig S. Steele, Exogi LLC, USAJP Bonn, Exogi LLC, USA
HPEC 2012 Fast Functional Simulation with a Dynamic Language Craig S. Steele and JP Bonn Exogi LLCLas Vegas, NV, USA
Presentation Outline • Motivation for Fast Functional Simulator • Programmable Components • Static versus Dynamic Simulators • Dynamic Binary Translation • Lua Scripting Language • Dynamic Language Tracing JIT compilers • Instruction Pipeline in Software • Simulating Heterogeneous Devices
Motivation for Fast Functional Sim • Larger systems (including SoCs, FPGAs) • Many programmable components • Non-programmable have functional spec • Lower-level simulations slow, getting slower • Waterfall (or over-the-wall) HW/SW flow • Software is late from the start, unintegrated • Fast functional sim enables SW developers • Future resilient systems need autonomic SW
Static vs. Dynamic ISS • Fetch-decode-execute iterative interpreter • 10+ MIPS • SW analog of a non-pipelined CPU • Pre-decode binary instructions • Convert each instruction to “C” function call • Compile C function call to native code • Do more of the above at runtime • Support JIT and self-modifying code
Dynamic Binary Translation • Dynamic Binary Translator (DBT) does it all • QEMU (Quick EMUlator) Fabrice Ballard • Hundreds rather than tens of simulated MIPS • Basic blocks (BBs), not single instructions • Can be used for whole-system simulation • Mainly PC-oriented virtual machines • Also some embedded non-PC systems • Support for several instruction sets
Code Generation DSL • Model target instruction set with “micro-ops” • Tiny Code Generator (TCG) • Second Generation DSL • Another app-specific language to remember • Each instruction set is a different port • Supportive user community is tiny • Simulator code is very different from TCG • New device may need new TCG micro-ops • Code differs for prog & non-prog devices
Static vs. Dynamic Languages • Static Languages: C / C++ • Variables have a type • Compiler detects type mismatches • Compiled separately from execution • Optimize lots of possible paths • Dynamic Languages • Data is typed, variable can hold any datum • Type is detected at runtime • “Scripting” languages are often interpreted
Lua Language • Small, multi-paradigm “scripting” language • Supports a functional programming style • First-class functions & closures • Small conceptual and resource footprint • Well under 300KB memory footprint • Good embeddability and portability • Good interface to C
Lua Language Quirks • Only composite data type is table • Associative array • Supports both integer and hashed keys • Limited native data types • “Number” type: double or dual double/integer • Dynamic type inference/specialization • No standardized class-based object system • Many object systems, may not intermix • Tables & closures reduce centrality of classes
LuaJIT Compiler • Tracing Just-in-Time compilers • Excellent for dynamic languages • Data types can be observed at runtime • Optimization and specialization of types • Source compiled to register-based bytecode • Fast assembly-coded interpreter for tracing • “Hot code” chunks optimized to native binary • Excellent C Foreign-Function Interface (FFI)
C One-Instr. Fetch-Execute Loop for (int i=0; i < nIters; i++) { #if TRACE printf("%08d| PC = 0x%08X\n", i, node->curPC); if (node->curPC == pausePC) pauseCount++; #endif uint32_t deltaPC = sfp((void *)di, (void *)node); node->curPC = node->nextPC; di = virtual_icache_entry(node->nextPC,node); sfp = di->fPtr; node->nextPC += deltaPC; }
Lua One-BB Fetch-Execute Loop for i = 1, n_BBs do -- 140 MOps vs 572 MOps with single-entry cache if cur_pc ~= next_pc then -- avoid table lookup if no change cur_pc = next_pc cur_instr = loc_imem[cur_pc] end next_pc = cur_instr(cpu) end
Closures as Object Substitutes 1 -- Generated generic code for all “xor” instructions -- Anonymous function is a “factory” for XOR/XORI instr. instances -- When factory function is called, function arguments are persistently bound to values return function(seq_instr,seq_pc,br_pc,cpu,op_fcns,store_RX,valA,valB) local result = op_fcns.lm32_alu_fcn_xor(valA,valB) return store_RX(seq_instr,seq_pc,br_pc,cpu,result) end
Closures as Object Substitutes 2 -- Generated generic code for all “bge” branch instructions -- Anonymous function is a “factory” for BGE/BGEU instr. instances -- When factory function is called, function arguments are persistently bound to values return function(seq_instr,seq_pc,br_pc,cpu,op_fcns,select_pc,valA,valB) local result = op_fcns.lm32_cond_fcn_bge(valA,valB) local next_pc = select_pc(seq_pc,br_pc,result) return next_pc -- Dynamic return types: (integer | function ref) end
Tail-Call Optimization • Where does “TCOSim” name come from? • TC: if last statement in a function is a fcn call • A tail call is a GOTO for functional languages • Tail-call Optimization (TCO) • Recursion without a stack • Multiple functions can be optimized as one • LuaJIT tracing compiler optimizes calls away • Functions are organizational, not executable
Pipelined Instruction Definition Example: 3-operand RISC ALU instruction class local function lm32_build_RR_format_instr(format,instr) … local load_valB = lm32_load_RZ[instr[4]] … return load_valB(seq_instr, seq_pc,cpu,lm32_ops,store_val,alu_op,load_valA) …return load_valA (seq_instr, … … return alu_op(seq_instr, … … return seq_instr(cpu) Example: Helper ALU operation shared by XOR/XORI local function lm32_alu_fcn_xor(x,y) local result = bit_bxor(x + y) return result end
Basic Block Instruction Fusion • Non-branching instructions end with tail call return seq_instr(cpu) • Branch instructions return new PC lm32_operation_builder_string["BI"] = [[ return function(seq_instr, seq_pc,br_pc,cpu,op_fcns,select_pc,valA,valB) local result = op_fcns.lm32_cond_fcn_%s(valA,valB) local next_pc = select_pc(seq_pc,br_pc,result) return next_pc end ]]
Basic C vs. TCOSim ISS Speed C-based Single-Instruction Sim: Tens of MIPS
Simulating Heterogeneous Devices • Simulator calls a chain of instructions • Basic block, ending with branch • Device/CPU is a parameter • CPU or other programmed HW • With HDL design, most HW is a program • Devices can be intermixed in BB chain • Multiple CPU HW threads (or hetero CPUs) • DMA devices • Event polling devices, e.g., interrupt control
Future Directions • Integrate gdb debug server • Compile existing C server code for target • Run gdb server as “virtual” thread on sim • Should be fairly portable to real HW • Support external events, e.g., interrupts • Interleave polling instruction blocks • Asynchronous callback to inner loop • Multi-threaded simulation host support • Self-host LuaJIT compiler within sim target
Conclusion • Lua: small dynamic scripting language • Easy dynamic-code-creating programs • Fast edit-run-expletive cycle • Can be fast with a tracing JIT compiler • Surprisingly good fit for functional simulation • Loose typing good for heterogenous devices • Conventional coding, not exotic ISS DSL • Small enough and fast enough to go meta
Meta-Meta http://xkcd.com/917/
end --Thank You! Craig Steele, steele@exogi.com
Caching in Lua • Lua has one primitive composite data type • Table is associative array • Almost any data type can be used as index • Integers are optimized like C arrays • Other index types are used for hashtable • Tables are both arrays and hashtables • Meta-tables allow custom index operations • Using one table to cache another is very easy
A Use Case of Caching in Lua • Instruction-memory decoding can be either • Static: binary available at start, unchanging • Dynamic: binary is JITed or self-modifying • Step 0 – Get binary encoded instruction • Step 1 – Decode to table, e.g., {"add”,1,0,21} • Step 2 – Build instruction as stack of Lua fcns • Step 3 – Link instructions into basic blocks • Step 4 – Locate BBs at absolute addresses • Every step can be cached and “on demand”