160 likes | 241 Views
CS252 Project Presentation Optimizing the Leon Soft Core. Project Outline. Goal: Reduce the size of Leon on FPGAs Our motivation for using Leon: RAMP research: emulation of multiprocessors Analysis: LUT breakdown Optimizations: Circuit Level Architectural Level. Leon Overview.
E N D
Project Outline • Goal: Reduce the size of Leon on FPGAs • Our motivation for using Leon: • RAMP research: emulation of multiprocessors • Analysis: • LUT breakdown • Optimizations: • Circuit Level • Architectural Level
Leon Overview • 32-bit SPARC V8 compliant processor • 7 stage pipeline, in-order • Separate L1 Instruction & Data caches • Configurable cache size, associativity, replacement policy • Optional Memory Management Unit • AMBA bus interface to memory and peripherals • Supports Symmetric Multiprocessing • Open-source (Gaisler Research)
Area analysis • Configuration • MMU: Combined I/D-TLB, 2-entry only • Integer MUL/DIV enable • Cache: Direct-map I/D cache • Variables • DSU - Debug support unit • Target clock • 20 MHz - easy to achieve • 200 MHz - over constrained
Why it’s BIG • Debugging Support • More MUXes • One additional pipeline stage • Useful for RAMP emulation / bootstrapping • IU is over 50% • Barrel shifter • Pipeline control (forwarding)
Circuit Level Optimizations • Store LRU bits in Block RAMs instead of Flip Flops • Also saves LUTs • One-hot encoding for signals • Synthesis tool does a good job of 1-hot encoding for many signals (e.g., state encoding) • Applied this to the cache output • Instead of data(set), we can use data(0) or data(1) or data(2) or data(3) • Useful only for multiway caches • LUT savings: ~ 100 LUTs
Circuit Level Optimizations • Use fast-carry chain logic • Provided 30% savings in LUT usage for TLB entries • Multipliers for barrel shifter • Right shift by b is same as multiplication by 2^b • Savings of ~ 100 LUTs
LUTs for Integer Mul / Div • 2195 / 18429* for entire two core system (12%) • 11.5% of Leon3 core • *(Xilinx ISE)
Didn’t your mother teach you to share? • Savings of ~350 LUTs for prototype • Only multiplier shared • Only two cores • 10% could become 5%..2.5%...1%…. • Even more for MAC
Operand MUXes: 32 bit, 7 to 1 MUX 32 bit, 5 to 1 MUX
Operand MUXes • 313 LUTs + 64 MUX /each
Integer Pipeline Changes • Remove all forwarding • Single thread: Just stall • Fine Grain Multithreading could boost performance • LUTs saved: 27-37 % • Maximum Freq improvement: 20%
Conclusions • CAD tools already perform many optimizations • Remove unused logic • Infer technology dependent logic from HDL source, e.g. Fast carry chain logic • Optimize logic globally
Conclusions • Optimization is possible • Higher levels yield (much) greater savings • Circuit Level: 200-300 LUTs • Architectural Level: 1000+ of LUTs • Sharing: ~700 per core • Total: 35-40% savings