1 / 19

RAMP Gold

RAMP Gold. RAMPants {rimas,waterman,yunsup}@cs. Parallel Computing Laboratory University of California, Berkeley. A Survey of μArch Simulation Trends. Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?. A Survey of μArch Simulation Trends.

alcina
Download Presentation

RAMP Gold

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAMP Gold RAMPants {rimas,waterman,yunsup}@cs Parallel Computing Laboratory University of California, Berkeley

  2. A Survey of μArch Simulation Trends Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?

  3. A Survey of μArch Simulation Trends Something seems broken here…

  4. A Survey of μArch Simulation Trends Something is clearly broken here.

  5. Something is Rotten in theState of California • A median ISCA ‘08 paper’s simulations run for fewer than four OS scheduling quanta! • We run yesterday’s apps at yesteryear’s timescales • And attempt to model N communicating cores with O(1/N) instructions per core?! • The problem is that simulators are too slow • Irony: since performance scales as sqrt(complexity), simulated instructions per wall-clock second falls as processors get faster

  6. RAMP Gold: Our Solution • RAMP Gold is an FPGA-based, 100 MIPS manycore simulator • Only 100x slower than real-time • Economical: RTL is BSD-licensed; commodity HW

  7. Our Target Machine 64 cores SPARC V8 CORE SPARC V8 CORE SPARC V8 CORE SPARC V8 CORE … I$ D$ I$ D$ I$ D$ I$ D$ Shared L2$ / Interconnect DRAM

  8. RAMP Gold Architecture • Mapping the target machine directly to an FPGA is inefficient • Solution: split timing and functionality • The timing logic decides how many target cycles an instruction sequence should take • Simulating the functionality of an instruction might take multiple host cycles • Target time and host time are orthogonal

  9. Function/Timing Split Advantages • Flexibility • Can configure target at runtime • Synthesize design once, change target model parameters at will • Efficient FPGA resource usage • Example 1: model a 2-cycle FPU in 10 host cycles • Example 2: model a 16MB L2$ using only 256KB host BRAM to store tags/metadata

  10. Host Multithreading Build 64 pipelines Time-multiplex one pipeline F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W … 64 pipelines time … • Single hardware pipeline with multiple copies of CPU state • No bypass path required • Not multithreaded target! F D X M W F D X M W How are we going to model 64 cores? time

  11. Cache Modeling tag index offset max associativity … tag, state tag, state tag, state = = = hit : don’t stall miss : stall arbitrary cycles The cache model maintains tag, state, protocol bits internally Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall

  12. Putting it all together instruction cache ifetch stage decode stage register access stage memory stage data cache exception stage memory controller cache model & performance counters • Resource Utilization (XC5VLX110T) • LUTs – 14%, BRAM – 23% • We can fit 3 pipelines on one FPGA!

  13. Infrastructure

  14. Our accomplishments this semester

  15. HARDware ain’t no joke

  16. Sample Use Case: L1 D$ Tradeoffs • Assume we have a 64-core CMP with private 16KB direct-mapped L1 D$ • In the next tech gen, we can fit either of these improved configurations in a clock cycle: • 32KB direct-mapped L1 • 16KB 4-way set-associative L1 • Which should we choose?

  17. Sample Use Case: L1 D$ Tradeoffs Evidently, the associative cache is superior It took longer to make these slides than to run these 10+ billion instruction simulations

  18. Future Directions • RAMP Gold closes two critical feedback loops • Expedient HW/SW co-tuning is within our grasp • Simulations can now be run on a thermal timescale, enabling the exploration of temperature-aware scheduling policies • We intend to explore both avenues!

  19. DEMO: Damascene Image Convert Colorspace Textons: K-means Intervening Contour Bg Cga Cgb Texture Gradient Generalized Eigensolver Combine Oriented Energy Combination Non-max suppression Combine, Normalize Contours

More Related