1 / 24

An Analytical Model for CMPs

An Analytical Model for CMPs. Spring 2003 ECE/CS 757 University of Wisconsin - Madison. Peter McClone Kim-Huei Low. Overview. Introduction Simulation Environment Analytical Model ### Results Future work Conclusion. Introduction. Performance limits of superscalar processors

aulii
Download Presentation

An Analytical Model for CMPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Analytical Model for CMPs Spring 2003 ECE/CS 757 University of Wisconsin - Madison Peter McClone Kim-Huei Low

  2. Overview • Introduction • Simulation Environment • Analytical Model ### • Results • Future work • Conclusion

  3. Introduction • Performance limits of superscalar processors • Relative chip area increases • Chip Multiprocessors (CMP) • Design by simulation data • Time constraint • Analytical Model

  4. CPU0 … CPUN iL1 dL1 iL1 dL1 L2 Memory CMP System Model • Like Piranha • Multi-programmed workload • SimpleMP?

  5. CacheTracer

  6. Processor Model • Most area-efficient model • Area vs performance • SimpleScalar is used

  7. SimpleScalar • Generate address traces • Ignore instruction addresses • ~100% hit rate • Fetched directly from memory, no allocation in L2 • Very little interference in L2 • Performance

  8. Simulator Combination • 1. Generate address traces using word size cache blocks at all levels and minimal cache sizes • 2. Fed into CacheTracer to simulate the cache interference at the L2 level and generate the statistics needed for the model computation • 3. Model is used to compute performance estimates for variations of all of the cache parameters

  9. Analytical Model • A mathematical equation for IPC • Combination of observations made in research papers and from intuitive knowledge • Based mainly on a detailed cache model that was the focus of the project • Processor model is very simple

  10. Analytical Model • 3 part processor model: M(C,t) = M(C,t)startup+ M(C,t)nonstationary + M(C,t)intrinsic • C is a cache configuration • t is a time granule of r references

  11. Analytical Model • Startup effects are the number of unique blocks accessed in the first time granule, u(B) • The miss rate is u(B) divided by the total number of references. • M(C,t)startup = u(B) rt

  12. Analytical Model • Nonstationary misses are caused by the change in the working set of a process • This is the difference between the total number of unique blocks accessed to this point U(B) and the blocks attributed to startup: • M(C,t)nonstationary = U(B) – u(B) rT • The sum of the startup and nonstationary misses are simply the number of unique blocks in the trace.

  13. Analytical Model • Intrinsic misses are caused inherently caused because of program behavior • The natural sequence of accesses cause cache lines to displace each other • M(C,t)intrinsic = c * [u(B) - ∑D S*d* P(B,d) ] r d=0

  14. Analytical Model • Intrinsic misses are caused inherently caused because of program behavior • The natural sequence of accesses cause cache lines to displace each other • M(C,t)intrinsic = c * [u(B) - ∑D S*d* P(B,d) ] r d=0 S is the number of sets

  15. Analytical Model • Intrinsic misses are caused inherently caused because of program behavior • The natural sequence of accesses cause cache lines to displace each other • M(C,t)intrinsic = c * [u(B) - ∑D S*d* P(B,d) ] r d=0 P(B,d) is the probability that that d blocks of size B map into any cache set. S is the number of sets

  16. Analytical Model • Intrinsic misses are caused inherently caused because of program behavior • The natural sequence of accesses cause cache lines to displace each other • M(C,t)intrinsic = c * [u(B) - ∑D S*d* P(B,d) ] r d=0 P(B,d) is the probability that that d blocks of size B map into any cache set. S is the number of sets The sum computes the probability that a cache set has any number fewer than D blocks that map to it

  17. Analytical Model • Intrinsic misses are caused inherently caused because of program behavior • The natural sequence of accesses cause cache lines to displace each other • M(C,t)intrinsic = c * [u(B) - ∑D S*d* P(B,d) ] r d=0 P(B,d) is the probability that that d blocks of size B map into any cache set. S is the number of sets This many blocks cannot cause misses. Fewer blocks than the set size map to the set.

  18. Analytical Model • Intrinsic misses are caused inherently caused because of program behavior • The natural sequence of accesses cause cache lines to displace each other • M(C,t)intrinsic = c * [u(B) - ∑D S*d* P(B,d) ] r d=0 • However the other blocks (u(B) – the sum) may cause misses. This is estimated by multiplying by the measure collision rate c P(B,d) is the probability that that d blocks of size B map into any cache set. S is the number of sets This many blocks cannot cause misses. Fewer blocks than the set size map to the set.

  19. Analytical Model • Combining all three components of the miss rate results in the equation: M(C,t) = u(B) + U(B)–u(B) + c * [u(B)-∑D S*d* P(B,d)] rt rt r d=0 • Average memory access time can be trivially computed: AMA = (1 – ML1) * HitTimeL1 + (ML1 * (1 - ML2) * HitTimeL2 + (ML1* ML2) * HitTimemain

  20. Analytical Model • The final model then incorporates the AMA into a simple processor model based on: • Issue Width (IW) • Issue Capabilities (IC) • Average memory penalty (AMP) • Average memory access time (AMA) • Pi = IWi * [(AMAi + 2)/3 * AMPi ] • Total performance is the sum of each processor • Pcmp = ∑Nc Pi i=1

  21. Results

  22. Future work • Modify SimpleMP or RSIM • Modify SMP or DSM system to CMP system • Fix or create the multi-programmed loader • Incorporate processor parameters into model • Issue width, instruction window size, branch predictor policy, # of ALUs, etc.

  23. Conclusion • ### • Questions?

  24. References • [1] AGARWAL, A., HOROWITZ, M., AND HENNESSY, J. An analytical cache model. Computer Systems Lab. Rep. TR 86-304, Stanford Univ. Stanford, Calif., Sept. 1986 • [2] J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In The 10th International Conference on Parallel Architectures and Compilation Techniques, pages 199-210, September 2001. • [3] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In The 27th Annual International Symposium on Computer Architecture, pages282–293, June 2000.

More Related