1 / 19

Paper Presentation

Paper Presentation. A Helper Thread Based Dynamic Cache Partitioning Scheme for Multithreaded Applications. 2013/06/10 Yun-Chung Yang. Kandemir, M . , Yemliha , T. ; Kultursay , E. Pennsylvania State Univ., University Park, PA, USA

zayit
Download Presentation

Paper Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paper Presentation A Helper Thread Based Dynamic Cache Partitioning Scheme for Multithreaded Applications 2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE Page 954 – 959

  2. Outline • Abstract • Related Work • Motivation • Difference between inter and intra application • Proposed Method • Experiment Result • Conclusion

  3. Abstract Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.

  4. Related Work Resource Management Processor cores[6] Shared cache[5, 4, 8, 11, 12, 17, 18, 20] Off-chip bandwidth[3, 10, 13] Application granularity Intra-application shared cache[16] This paper Improve the cache layer problem

  5. Motivation • Run application of facesim(PARSEC) and art(SPECOMP). • Perform six scheme and recorded the Average Memory Access Time(AMAT). • No-partition • Uniform • Nonuniform • Nonuniform-L2 • Nonuniform-L3 • Dynamic • Dynamic outer perform the rest • Divide application into fixed epoch and performs the best.

  6. Difference between Inter & Intra App. • The objectives and the implementation are different on cache partition. • The intra-application cache partition tries to minimize the latency of the slowest thread. • Runtime system or dynamic compiler • The inter-application cache partition tries to optimize workload throughput. • OS problem

  7. The Proposed Method • Dynamic Partition System • Helper Thread whose main responsibility is to partition the cache space allocated to the application to maximize its performance. Performance Modeling System Interfacing Performance Monitoring

  8. Proposed Method(cont.) • Each OS epoch is composed many application, which divided into 5 epoch. • Performance Monitoring • Performance Modeling • Resource Partitioning • System Interfacing • Application Execution

  9. Performance Monitoring • Use Average Memory Access Time as measure of the cache performance of a thread. • AMAT • The ratio of total cycles spent on memory instructions and total number of instructions • Depends on the cache partition size • Take into account with different level of cache

  10. Performance Modeling • Need to predict the impact of increasing and decreasing the cache space to a thread. • Expressed a thread with 3D plot • X and Y respectively for cache space allocation from L2 and L3 • Thread i, point d(sL2, sL3) value to build dynamic model for thread i. • Purpose – predict the performance of a thread

  11. Cache Space Partitioning • ith L2 cache, qL2,i denotes the total cache way allocated to this application. • qL2,i are shared by mL2,i thread(from 0 to mL2,i) • The number of ways allocated to the kth thread is denoted as sL2,i(k)

  12. Cache Space Partitioning Algorithm • P[t] denotes cache resources(numbers of way in L2 & L3).

  13. System Interfacing • New partition information is delivered to the OS using system call. • Add new instruction to ISA • COID = core ID, CLVL = cache level, CAID = cache ID, W = 64bit wide way allocation

  14. What we want to know • The experimental environment • Compare with other scheme • Average Memory Access Time • The main target of the performance monitoring • Execution Cycle

  15. Experiment Environment • SIMICS and GEMS to model below multicore architecture. • Run SPECOMP and PARSEC application. • Use 120 million instruction as application epoch.

  16. Experiment Environment(cont.) • Perform 8 schemes and recorded average memory access time • No-partition • Uniform – as evenly as possible for each core • Static Best – static partition for best result through exhaustive search • Dynamic – the proposed method • Dynamic-L2 – partition only L2 • Dynamic-L3 – partition only L3 • L2+L3 – a separate performance model for each one. • Ideal – optimal strategy

  17. Improve Performance • Shows that balancing the data access latency of different threads. • As the execution went on, they all end up at about 8 AMAT(cycle).

  18. Conclusion • Intra-application cache partitioning for multithread • Dynamic model, able to partition cache in multiple layer. • Average improvement of 17.1% in SECOMP and 18.6% in PARSEC. • My Comment • Remind me the importance of software and hardware cooperation. • Thread is a main issue in CMP.

More Related