1 / 26

ACCESS: Smart Scheduling for Asymmetric Cache CMPs

ACCESS: Smart Scheduling for Asymmetric Cache CMPs. Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari Makineni†, Paul Brett†, Chita Das‡ Intel Labs (Oregon) † Penn State‡. Agenda. Motivation Related Work ACCESS Architecture

shepry
Download Presentation

ACCESS: Smart Scheduling for Asymmetric Cache CMPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ACCESS: Smart Scheduling for Asymmetric Cache CMPs Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari Makineni†, Paul Brett†, Chita Das‡ Intel Labs (Oregon)† Penn State‡

  2. Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions

  3. Core0 Core1 Core0 Core1 Small Cache Large Cache Cache Motivation Applications tend to have non-uniform cache capacity requirement Energy inefficiency Virtual Asymmetry Physical Asymmetry

  4. Benefit of Physically Asymmetric Caches • Fit the asymmetry in working set size of apps • Apps have small WSS/streaming apps on small cache • Apps have large WSS on large cache • Help improve energy per instruction • Large cache can be power gated when not in use • Smaller cache enables lower operating voltage • Fit the need of heterogeneous-core architectures • Asymmetric cores naturally need asymmetric caches 512KB: 0.8v 4MB: 1.0v

  5. Challenges in Asymmetric Caches • What are the H/W supports needed? • H/W exposes certain cache stats to OS • What are the OS scheduler changes needed? • Scheduler be aware of the underlying cache asymmetry • New scheduling policy to explore cache asymmetry

  6. Contribution of ACCESS • ACCESS Architecture • Enables asymmetric caches • ACCESS Prediction Engine • Runtime online measurement of cache stats • Stats exposed to OS • Asymmetric Cache Scheduler • Finds out best-performing schedule with one-time training • Deals with private caches and shared cache • Estimate shared cache contention effects with simple heuristics • O(1) complexity • Real machine measurement shows >20% performance improvement over default Linux scheduler

  7. Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions

  8. Related Work • OS schedulers for heterogeneous-core architectures • Li et al. HPCA’10 • Kumar et al. Micro’03, ISCA’04 • OS scheduler or H/W approaches for mitigating cache contention effects • Chandra et al. HPCA’05 • Kim et al. PACT’05

  9. Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions

  10. ACCESS Architecture • Run tasks with one-time training • APE measures/predicts task cache stats on big/small caches • OS makes schedule based on cache stats by APE

  11. App 1 App 2 Tag Array Way 0 Way 1 Way 15 Set 0 Set 1 Set 2 Hit / Miss 4 MB LLC Controller Set 4095 0 1 15 0 0 15 1 1 0 15 0 1 1 Set 0 Set 1 512 KB Shadow 4 MB 4 MB Tags Set 16 App 1 App 2 App 1 App 2 Access Prediction Engine Provides cache stats for each app on each cache (running alone) • Shadow tags • Use set sampling to reduce size and #accesses • Using multiple shadow tags, although App1&2 share the cache, we can still measure the cache stats of App1&2 running alone on the 4MB(L) and 512KB(S) cache Shadow tag = cache w/o data array

  12. Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions

  13. Asymmetric Cache Scheduler • Goal of the scheduler: improve overall threads performance • Perform least training to minimize training overhead • Thread’s stats available to the scheduler • Instruction count etc. • Cache misses of each thread running alone on each cache • In practice, we find schedule that has minimal overall MPI yields best overall performance

  14. Core1 Core0 Core2 Core3 Core0 Core1 Small Small Large Cache Cache Cache Large Cache ACS Examples T1 T2 • Private caches, e.g. 2T case • calculate <MPIT1_L, MPIT1_S>, <MPIT2_L, MPIT2_S> • compute MPIsum of all possible schedules • MPIsum1= MPIT1_L + MPIT2_S • MPIsum2= MPIT1_S + MPIT2_L • pick min(MPIsum1, MPIsum2) • Shared caches, e.g. 4T case • calculate <MPITi_L, MPITi_S> • compute MPIsum of all possible schedules • MPIsum= MPITiTj_L + MPITxTy_S • MPITiTj_Land MPITxTy_Sare estimated • pick MPIsum min T1 T2 T3 T4

  15. Estimating Cache Contention Effect • Task: given MPITi_L/S,MPITj_L/S, estimate MPITiTj_L/S • Cache power law Hartstein et al. MRnew= MRold* (Cnew/Cold)-α MPInew= MPIold* (Cnew/Cold)-α We can compute α of each thread α= -logCL/CS(MPITi_L/MPITi_S) α measures how sensitive the app is to cache capacity

  16. Estimating Cache Contention Effect (cont.) • Estimating cache occupancy for Ti when Ti,Tj share cache

  17. Scheduler Compute Overhead • Computing and sorting all possible schedules has O(n2) complexity • To arrive at the best schedule, #thread migrations might be unbounded

  18. O(1) ACS • Goal of O(1) ACS • O(1) complexity • Limited number of thread migrations to arrive a best schedule • O(1) ACS algorithm • For each thread (Ti) arrival, comparing MPIsum of 6 cases • 1. Ti on L • 2. Ti on L, migration candidate on L -> S • 3. Ti on L, migration candidate on L <-> migration candidate on S • 4. Ti on S • 5. Ti on S, migration candidate on S -> L • 6. Ti on S, migration candidate on S <-> migration candidate on L • Pick the best schedule in 1-6 • Update migration candidate based on the 2nd best schedule

  19. O(1) ACS Example T1 T2 MPIs Thread MPI on L MPI on S State at t0 T1 0.40 0.50 T2 0.45 0.90 T3 MPI ACS computation at t1 Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum Case MPIL MPIS MPIsum T2 T1 T2 T1 0.45 0.50 0.95 1 1.05 0.50 1.55 T3 on L 2 0.60 1.40 2.00 T3 on L, T2->S 3 1.00 0.90 1.90 T3 on L, T2->S, T1->L 4 0.45 1.25 1.70 T3 on S Thread MPI on L MPI on S 5 0.85 0.75 1.60 T3 on S, T1->L T3 0.60 75 6 0.40 1.65 2.05 T3 on S, T1->L, T2->S State after t1 Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum T2,T3 T1 1.05 0.50 1.55 T3 T1

  20. O(1) ACS Efficacy • Constant computation overhead • Always 97% close to best schedule

  21. Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions

  22. Evaluation Setup • Real machine based measurement on Xeon5160 • 4 cores at 3Ghz • 32KB split L1 caches • 2 cores share L2, 4MB and 512KB each • ACS scheduler • Implemented in Linux 2.6.32 • Enable fast thread migration • Since no APE h/w available, MPIs profiled offline with 2% errors applied (to take into account effects of set sampling) • Benchmarks • 17 C/C++ SPEC2006 benchmarks • 2T and 4T workloads that cover both cache sensitive (S) and insensitive (I) benchmarks • Run until first thread exits

  23. Evaluation Results of ACS (2T) • Performance improvement in all 70 cases • Avg 20% speedup • Demonstrate the efficacy of ACS

  24. Evaluation Results of ACS (4T) • Performance improvement in all 30 cases • Avg 31% speedup • Demonstrate the efficacy of ACS and cache contention estimation effort

  25. Conclusions • We have proposed ACCESS architecture • Enforce physically asymmetric caches • ACCESS Prediction Engine • Use shadow tags to conduct online cache simulation • We have also proposed ACS scheduler • One time training, using MPIsum metric to derive the best performing schedule • Practical approach to estimate shared cache contention effects • O(1) ACS scheduler • Minimizes scheduler computation overhead • Limits thread migrations • Real platform measurements show >20% speedup over Linux scheduler

  26. Thanks!

More Related