260 likes | 347 Views
ACCESS: Smart Scheduling for Asymmetric Cache CMPs. Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari Makineni†, Paul Brett†, Chita Das‡ Intel Labs (Oregon) † Penn State‡. Agenda. Motivation Related Work ACCESS Architecture
E N D
ACCESS: Smart Scheduling for Asymmetric Cache CMPs Xiaowei Jiang†, Asit Mishra‡, Li Zhao†, Ravi Iyer†, Zhen Fang†, Sadagopan Srinivasan†, Srihari Makineni†, Paul Brett†, Chita Das‡ Intel Labs (Oregon)† Penn State‡
Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions
Core0 Core1 Core0 Core1 Small Cache Large Cache Cache Motivation Applications tend to have non-uniform cache capacity requirement Energy inefficiency Virtual Asymmetry Physical Asymmetry
Benefit of Physically Asymmetric Caches • Fit the asymmetry in working set size of apps • Apps have small WSS/streaming apps on small cache • Apps have large WSS on large cache • Help improve energy per instruction • Large cache can be power gated when not in use • Smaller cache enables lower operating voltage • Fit the need of heterogeneous-core architectures • Asymmetric cores naturally need asymmetric caches 512KB: 0.8v 4MB: 1.0v
Challenges in Asymmetric Caches • What are the H/W supports needed? • H/W exposes certain cache stats to OS • What are the OS scheduler changes needed? • Scheduler be aware of the underlying cache asymmetry • New scheduling policy to explore cache asymmetry
Contribution of ACCESS • ACCESS Architecture • Enables asymmetric caches • ACCESS Prediction Engine • Runtime online measurement of cache stats • Stats exposed to OS • Asymmetric Cache Scheduler • Finds out best-performing schedule with one-time training • Deals with private caches and shared cache • Estimate shared cache contention effects with simple heuristics • O(1) complexity • Real machine measurement shows >20% performance improvement over default Linux scheduler
Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions
Related Work • OS schedulers for heterogeneous-core architectures • Li et al. HPCA’10 • Kumar et al. Micro’03, ISCA’04 • OS scheduler or H/W approaches for mitigating cache contention effects • Chandra et al. HPCA’05 • Kim et al. PACT’05
Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions
ACCESS Architecture • Run tasks with one-time training • APE measures/predicts task cache stats on big/small caches • OS makes schedule based on cache stats by APE
App 1 App 2 Tag Array Way 0 Way 1 Way 15 Set 0 Set 1 Set 2 Hit / Miss 4 MB LLC Controller Set 4095 0 1 15 0 0 15 1 1 0 15 0 1 1 Set 0 Set 1 512 KB Shadow 4 MB 4 MB Tags Set 16 App 1 App 2 App 1 App 2 Access Prediction Engine Provides cache stats for each app on each cache (running alone) • Shadow tags • Use set sampling to reduce size and #accesses • Using multiple shadow tags, although App1&2 share the cache, we can still measure the cache stats of App1&2 running alone on the 4MB(L) and 512KB(S) cache Shadow tag = cache w/o data array
Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions
Asymmetric Cache Scheduler • Goal of the scheduler: improve overall threads performance • Perform least training to minimize training overhead • Thread’s stats available to the scheduler • Instruction count etc. • Cache misses of each thread running alone on each cache • In practice, we find schedule that has minimal overall MPI yields best overall performance
Core1 Core0 Core2 Core3 Core0 Core1 Small Small Large Cache Cache Cache Large Cache ACS Examples T1 T2 • Private caches, e.g. 2T case • calculate <MPIT1_L, MPIT1_S>, <MPIT2_L, MPIT2_S> • compute MPIsum of all possible schedules • MPIsum1= MPIT1_L + MPIT2_S • MPIsum2= MPIT1_S + MPIT2_L • pick min(MPIsum1, MPIsum2) • Shared caches, e.g. 4T case • calculate <MPITi_L, MPITi_S> • compute MPIsum of all possible schedules • MPIsum= MPITiTj_L + MPITxTy_S • MPITiTj_Land MPITxTy_Sare estimated • pick MPIsum min T1 T2 T3 T4
Estimating Cache Contention Effect • Task: given MPITi_L/S,MPITj_L/S, estimate MPITiTj_L/S • Cache power law Hartstein et al. MRnew= MRold* (Cnew/Cold)-α MPInew= MPIold* (Cnew/Cold)-α We can compute α of each thread α= -logCL/CS(MPITi_L/MPITi_S) α measures how sensitive the app is to cache capacity
Estimating Cache Contention Effect (cont.) • Estimating cache occupancy for Ti when Ti,Tj share cache
Scheduler Compute Overhead • Computing and sorting all possible schedules has O(n2) complexity • To arrive at the best schedule, #thread migrations might be unbounded
O(1) ACS • Goal of O(1) ACS • O(1) complexity • Limited number of thread migrations to arrive a best schedule • O(1) ACS algorithm • For each thread (Ti) arrival, comparing MPIsum of 6 cases • 1. Ti on L • 2. Ti on L, migration candidate on L -> S • 3. Ti on L, migration candidate on L <-> migration candidate on S • 4. Ti on S • 5. Ti on S, migration candidate on S -> L • 6. Ti on S, migration candidate on S <-> migration candidate on L • Pick the best schedule in 1-6 • Update migration candidate based on the 2nd best schedule
O(1) ACS Example T1 T2 MPIs Thread MPI on L MPI on S State at t0 T1 0.40 0.50 T2 0.45 0.90 T3 MPI ACS computation at t1 Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum Case MPIL MPIS MPIsum T2 T1 T2 T1 0.45 0.50 0.95 1 1.05 0.50 1.55 T3 on L 2 0.60 1.40 2.00 T3 on L, T2->S 3 1.00 0.90 1.90 T3 on L, T2->S, T1->L 4 0.45 1.25 1.70 T3 on S Thread MPI on L MPI on S 5 0.85 0.75 1.60 T3 on S, T1->L T3 0.60 75 6 0.40 1.65 2.05 T3 on S, T1->L, T2->S State after t1 Thread on L Thread on S Candidate on L Candidate on S MPI on L MPI on S MPIsum T2,T3 T1 1.05 0.50 1.55 T3 T1
O(1) ACS Efficacy • Constant computation overhead • Always 97% close to best schedule
Agenda • Motivation • Related Work • ACCESS Architecture • ACS Scheduler • Evaluation Results • Conclusions
Evaluation Setup • Real machine based measurement on Xeon5160 • 4 cores at 3Ghz • 32KB split L1 caches • 2 cores share L2, 4MB and 512KB each • ACS scheduler • Implemented in Linux 2.6.32 • Enable fast thread migration • Since no APE h/w available, MPIs profiled offline with 2% errors applied (to take into account effects of set sampling) • Benchmarks • 17 C/C++ SPEC2006 benchmarks • 2T and 4T workloads that cover both cache sensitive (S) and insensitive (I) benchmarks • Run until first thread exits
Evaluation Results of ACS (2T) • Performance improvement in all 70 cases • Avg 20% speedup • Demonstrate the efficacy of ACS
Evaluation Results of ACS (4T) • Performance improvement in all 30 cases • Avg 31% speedup • Demonstrate the efficacy of ACS and cache contention estimation effort
Conclusions • We have proposed ACCESS architecture • Enforce physically asymmetric caches • ACCESS Prediction Engine • Use shadow tags to conduct online cache simulation • We have also proposed ACS scheduler • One time training, using MPIsum metric to derive the best performing schedule • Practical approach to estimate shared cache contention effects • O(1) ACS scheduler • Minimizes scheduler computation overhead • Limits thread migrations • Real platform measurements show >20% speedup over Linux scheduler