250 likes | 362 Views
Lutiac – Small Soft Processors for Small Programs. David Galloway and David Lewis November 18, 2010. Introduction. Lutiac is an experimental soft processor Designed for very small programs roughly 200 instructions roughly 200 words of data
E N D
Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010
Introduction • Lutiac is an experimental soft processor • Designed for very small programs • roughly 200 instructions • roughly 200 words of data • Take a drastic step to reduce the size of the processor • Measure its area and speed • Compare to NIOS II
Typical Microprocessor +1 PC A registers B registers From Outside World Instruction Memory ALU Decoder To Control Points To Outside World
Typical Microprocessor • Typical Microprocessor consists of: • data path (registers, ALU, ...) • controller (PC, instruction memory, decoder) • Data path has control inputs • register file read addresses • register file write address • register file write enable • instruction is add/subtract/and/or/copy/...
Control Inputs • Control inputs are driven from the decoder • Decoder driven from current instruction • Current instruction determined by program counter • If instruction memory never changes: • current instruction is a constant function of the program counter • so control inputs depend entirely on the value of the program counter
Control Inputs Are Function of PC • If we have small programs (≤ 64 total instructions) • program counter only needs 6 bits • Each control input is a function of 6 PC bits • could be replaced by a 6-lut • Entire decoder is a set of 6-luts • Instruction memory isn’t needed at all, and can be removed
Drastic Step - Delete Instruction Memory +1 PC A registers B registers X From Outside World Instruction Memory ALU Decoder To Control Points To Outside World
Lutiac +1 PC A registers B registers From Outside World ALU Decoder To Control Points To Outside World
Another Way to Think About It • At the point in a normal soft processor where the instruction is read from the instruction memory: instruction = instruction_memory[pc]; if(instruction is this) do this; if(instruction is that) do that; ... • Replace by a case statement based on the pc: case(pc) 0: do this; 1: do that; 2: do the other thing; ...
Lutiac Implementation • Built a very simple prototype 16-bit processor that uses hard-wired programs instead of an instruction memory • 3 stage pipeline • decode: sets read addresses on register file • execute: computes results, sets up register file writes • write back: register file write • One cycle per instruction
Lutiac Implementation • No data memory, just registers • no fixed instruction format, so no hard limit on number of registers • One input port from outside world, one output port • Simple assembler converts my_program.s file into an equivalent Verilog processor description
Experiments • Measure size and speed of Lutiac, varying: • number of different kinds of instructions in the program • size of the program • number of registers used • Used Quartus 8.0 (2 years ago now) • Stratix IV chips of various sizes, fastest speed grade • Each Stratix IV LAB contains 20 FFs + roughly 10 6-LUTs • Some LABs can be re-configured as 640 bit RAMs • known as “MLABs” • Will compare to NIOS II at the end, but for now, remember that a medium sized NIOS II uses 58 LABs and 11 M9K rams
Lutiac Size vs. Instruction Mix • Each program contains 64 random instructions, chosen from the allowed instruction types
Effect of Program Size • Size grows linearly as program size increases beyond 64 instructions, roughly 1 LAB for every 20 additional instructions
Effect of Number of Registers • Very large Lutiac (512 random instructions) grows by the number of MLABs needed to hold additional registers • Would save area if we used M9Ks instead of MLABs once we needed more than 96 16-bit registers
Scalability of Multiple Lutiac Cores • Chained N identical 64 instruction Lutiac cores together • LABs grow by 14.5 per core • Fmax drops as Quartus placement worsens • Ran out of DSP blocks above 256 cores
Comparison to NIOS II • Very inexact • NIOS II is 32 bits, Lutiac is 16 bits • NIOS II also has memory interfaces, caches, traps, ... • Configure NIOS II systems with 4K bytes of RAM • allows up to 1K words of instructions or data • Lutiac has no RAM, all instructions and data in MLABs • Lutiac and NIOS II both use four 18x18 multipliers (Multiplier/Accumulate mode)
Comparison to NIOS II • Back of the envelope guess (± factor of 2x) • Un-optimized 32-bit Lutiac is nearly twice the size of a 16-bit Lutiac (25 LABs); .75 the speed (177 MHz) • 32-bit Lutiac/NIOS IIs speed ratio = (177 / 235) • area ratio of Lutiac/NIOS IIs • (25 LABs + DSP) / (58 LABs + 11 M9K RAMs + DSP) = .3 • 32-bit Lutiac/NIOS IIs throughput/area • (177/235) / .3 = 2.5x • 32-bit Lutiac/NIOS IIe throughput/area • NIOS IIe is smallest NIOS, but isn’t pipelined, so has 5 cycles/instruction • (177/368 * 5/1) / ((25 LABs + DSP) / (37 LABs + 6 M9K RAMs)) = 4.5x
Lutiac Disadvantages • Limited to very small programs (200 instructions or so) • Must re-synthesize circuit every time program changes • instruction memory replaced by LUTs • would need good simulation tools • or a debug version of the processor that did have an instruction memory
Lutiac Advantages • Circuit is smaller, less complex than standard soft processor • One less stage in the pipeline • no instruction memory read required • Program contents are exposed to logic synthesis • data path components that aren’t used will be removed by synthesis • circuit may be smaller and faster
Lutiac Advantages • Flexible and powerful • wide range of useful instructions can be available • if not used by program, they will be synthesized away • easy to add specialized instructions if needed • Not limited by a fixed instruction word width or encoding • can use as many registers as the program wants
Lutiac Advantages • Processor self configures based on program • no “mega-wizard” needed • if multiplier/adder/etc. isn’t used, synthesis will leave it out • Data path can adapt to the program • Examples: • if program ever references a register immediately after writing to it, create a bypass register; else leave bypass register out of circuit • if multiplier and adder were used in parallel, create a separate copy of the register file for the multiplier; else have it share the adder’s register file
Conclusions • For small programs, it is possible to build 16-bit soft processors using only 12-25 LABs (plus multiplier) • smaller and faster than smallest 32-bit NIOS II (37 LABs, 6 M9K RAMs) • with instructions/second on the same order as the mid-size NIOS II (58 LABs, 11 M9K RAMs) • size advantage over NIOS II disappears as program size approaches 1000 instructions