820 likes | 941 Views
Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman , Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain, Jason Fritts Washington University in St. Louis http://liquid.arl.wustl.edu Funded by NSF under grant 03-13203 Sep 22.
E N D
Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain, Jason Fritts Washington University in St. Louis http://liquid.arl.wustl.edu Funded by NSF under grant 03-13203 Sep 22 Liquid Architecture Extracting & Improving Micro-architecture Performance onReconfigurable Architectures
Application Performance Architecture Compiler Algorithm
Customization cost/ performance tradeoff • Generic processor - cheap but application-agnostic; compilers exist; compiler optimization is the key • Reconfigurable logic - subject of our study;architecture and compiler research are the key • Customized logic - ideal for an application but expensive; logic/architecture research is key Generic FPGA Custom
Liquid architecture combines the best of all options • Standard Architecture • Standardized ISA, existing compilers • Custom Architecture on Integrated Circuit • One-of-a-kind, nonstandard • Liquid Architecture on FPGA • ISA + extras, can use modified open-source tools
Liquid architecture combines the best of all options • Standard Architecture • Standardized ISA, existing compilers • Not optimized for any specific application • Custom Architecture on Integrated Circuit • One-of-a-kind, nonstandard • Optimized for specific application • Liquid Architecture on FPGA • ISA + extras, can use modified open-source tools • Hardware can be optimized for specific application
Liquid architecture combines the best of all options • Standard Architecture • Standardized ISA, existing compilers • Not optimized for any specific application • Fixed instructions and hardware • Custom Architecture on Integrated Circuit • One-of-a-kind, nonstandard • Optimized for specific application • Fixed instructions and hardware • Liquid Architecture on FPGA • ISA + extras, can use modified open-source tools • Hardware can be optimized for specific application • Reconfigurable ISA; ~100us – 100ms; person hours and not $millions
Liquid architecture combines the best of all options • Standard Architecture • Standardized ISA, existing compilers • Not optimized for any specific application • Fixed instructions and hardware • ~ $200 - $500 • Custom Architecture on Integrated Circuit • One-of-a-kind, nonstandard • Optimized for specific application • Fixed instructions and hardware • ~ $500,000 - 1,000,000+ • Liquid Architecture on FPGA • ISA + extras, can use modified open-source tools • Hardware can be optimized for specific application • Reconfigurable ISA; ~100us – 100ms; person hours and not $millions • ~ $200 - $2000
Hardware platform overview Development Workstation FPX FPGA Internet Instrumentation and variations Interface support modules (VHDL) Memory, Network interface chip, … Standard ISA SPARC 8 FPX research was supported by NSF: ANI-0096052 and Xilinx Corp.
Hardware platform details FPX FPGA
Hardware platform details FPGA Core Cache Controller I-CACHE D-CACHE FPX LEON • LEON - SPARC8 compatible & • Open soft core
Hardware platform details FPGA SRAM / SDRAM Memory Controller Core Cache Controller Address/ Data bus AHB I-CACHE D-CACHE FPX LEON LEON • LEON - SPARC8 compatible & • Open soft core
Application execution Workstation program FPGA gcc SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface BLASTN DNA Sequence Comparison FPX LEON 001010 110110 001110
Application runtime Workstation FPGA SRAM / SDRAM Memory Controller Results & Timing 001010 110110 001110 Core Cache Controller Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface Slow! Where is time spent? FPX LEON
Software approach to profiling “time” Introduce timers Run the instrumented program Execution Timings Start with the program • Timers must account for their own overhead • Instrumented program will run slower • Instrumentation skews runtime as it affects system behavior such as cache, …
Cycle-accurate profiling for free Workstation FPGA SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Statistics Module Event monitor bus Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface Request Timings FPX LEON pc
Choose methods to profile from the user interface Method Time / Cycles Liquid architecture: cycle-accurate profiling for free .text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd
Method Address Range Liquid architecture: cycle-accurate profiling for free .text main Lo findMatch 0x4000027C 0x400003EF Hi addQuery computeKey computeBase coreLoop fillQuery Rnd
Liquid architecture: cycle-accurate profiling for free Method Event Monitor Bus PC CLK .text Stats Module main 0x4000035A Lo findMatch 0x4000027C 0x400003EF Hi addQuery computeKey computeBase coreLoop fillQuery Rnd
Liquid architecture: cycle-accurate profiling for free Function Event Monitor Bus PC CLK .text Stats Module Lo main 0x4000027C 0x4000035A 0x400003EF ≤ ≤ Hi findMatch Counter addQuery INCR computeKey computeBase coreLoop fillQuery Rnd
Liquid architecture: cycle-accurate profiling for free Function Event Monitor Bus PC CLK .text Stats Module Lo main 0x4000027C 0x4000035A 0x400003EF ≤ ≤ Hi addQuery Counter findMatch INCR computeKey computeBase Lo 0x400005D8 0x4000035A 0x4000061F ≤ ≤ Hi coreLoop fillQuery Counter INCR Rnd
Liquid architecture: cycle-accurate profiling for free Event Monitor Bus PC CLK Stats Module Lo 0x4000027C 0x4000035A 0x400003EF ≤ ≤ Hi Counter To Command Controller INCR Lo 0x400005D8 0x4000035A 0x4000061F ≤ ≤ Hi Counter INCR
Cycle-accurate profiling for free Workstation FPGA SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Statistics Module Event monitor bus Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface Request Timings FPX findMatch 500ms coreLoop 300ms LEON pc
“Where time was spent” for BLASTN… • Cycle-accurate profiling • No application overhead • Hence, at full speed
Cycle-accurate profiling for free Workstation FPGA pc SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Statistics Module Event monitor bus Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface FPX Is cache the problem? LEON
Software approach to profiling cache Simulate cache behavior CacheSimulator Timings Not possible to profile by coding!! Slow !!
Software approach to profiling “cache” Not possible to profile by coding!! Simulate cache behavior Scale down the program CacheSimulator Timings • Cannot afford to simulate the entire program
How do we detect and report cache behaviorusing Liquid Architecture?
Function Time / Cycles Liquid architecture: cache behavior for free .text main • Interface extends to include cache behavior options… findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd
Cache Hits / Misses Read Write Function Time / Cycles .text main findMatch addQuery computeKey computeBase coreLoop fillQuery Rnd
Cache profiling Workstation FPGA pc SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Statistics Module Event monitor bus Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface FPX LEON
Cache behavior Hits and misses in LEON
Cache behavior These signals are fed into the Event Monitoring Bus
Cache behavior Statistics Module
Cache behavior Statistics Module Statistics Module counts events
Cache profiling Workstation FPGA pc SRAM / SDRAM Memory Controller 001010 110110 001110 Core Cache Controller Statistics Module Event monitor bus Address/ Data bus AHB I-CACHE D-CACHE Command Controller Control S/W Interface Reads hits misses Writes hits misses FPX LEON
% Cache hit rate for D-cache: 1KB Function-wise cache profiling, in reasonable time
Liquid architecture enables fast, accurate results Seconds: fast, but no cache performance data available
Liquid architecture enables fast, accurate results Days: so slow you wouldn’t do this on the whole program
Liquid architecture enables fast, accurate results ½ hour: Practical, reasonably fast, totally accurate
Pipeline Stalls Branch Predict Function Time / Cycles Cache Hits / Misses Read Write .text main findMatch Can profile all other aspects of micro-architecture too… addQuery computeKey computeBase coreLoop fillQuery Rnd
How do we use the profiling info to improve application performance?
Reconfiguration Workstation FPGA SRAM / SDRAM Memory Controller 001010 110110 001110 Statistics Module Event monitor bus Address/ Data bus AHB Command Controller Control S/W Interface Cache Controller I-CACHE D-CACHE program FPX gcc Core Cache Controller I-CACHE D-CACHE
Cache hits after D-cache reconfiguration Conclusion for “large” run: D-cache doesn’t make much difference. Hit rate is already very high