140 likes | 157 Views
Explore the benefits and challenges of running programs on Heterogeneous Chip Multi-Processors, optimizing performance through a Latency-Aware Asymmetry CMP scheduling algorithm with Last Level Cache support.
E N D
Processes Scheduling on Heterogeneous Multi-core Architecture with Hardware Support Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences haoshouqing@ict.ac.cn
Contents Introduction Hardware support for LLC-miss latency LA-ACMP scheduling algorithm Evaluation and analysis
Introduction Heter-CMP: Heterogeneous Chip Multi-Processor Composed with some big cores and some small cores Big cores: large area, high power, high performance Adapted to CPU-bound programs, serial programs, …… Small cores: Small area, low power, low performance Adapted to memory-bound programs, parallel programs, …… Advantage Make good use of chip resources Reduce power and performance waste Challenge Identify applications’ behaviors when executing Schedule proper programs to proper cores
Hardware Support (1) Identify programs’ behaviors Last level cache (LLC) miss latency LLC miss Memory access Memory accesses induce high latency Affect programs’ efficiency when executed Can not make full use of cores’ performance Schedule rules Programs with high LLC miss latency should be scheduled to small cores Programs with low LLC miss latency should be scheduled to big cores
Hardware Support (2) Identify programs’ behaviors Last level Cache (LLC) miss latency Mechanism LLC miss delay is the period between miss request and miss response UN-Overlapped, Overlapped Record LLC miss latency for each core, with hardware support
Hardware Support (3) Implemented based on Godson-3A Record LLC miss request and response for each core, with hardware support
LA-ACMP Schedule Algorithm(1) LA-ACMP:Latency-Aware Asymmetry CMP Identify heterogeneity of cores Based on Linux kernel 2.6.18 Calculate BogoMIPS value of each core, evaluate each core’s performance Workload assignment balance Using Scaled Load method L=N/P: each core’s scaled load N: number of workloads being in queue P: processor’s performance If Lmax – Lmin <= 1, workload assignment balance
LA-ACMP Schedule Algorithm(2) LLC-delay buffer Append each run-queue with a LLC-delay buffer save each task’s LLC miss latency
LA-ACMP Schedule Algorithm(3) Update LLC-delay buffer When running, clear thread’s LLC-delay value When exhausting time slice, save thread’s LLC-delay value When migrate thread from queue-A to queue-B, also migrate LLC-delay value
LA-ACMP Schedule Algorithm(4) LA-ACMP algorithm Executed when judging balance Don’t destroy balance
Evaluate and analysis(1) Platform Godson-3A-heter Four cores: one works with 1GHz, three work with 500MHz Using asynchronous FIFO for synchronization Benchmark SPEC CPU2000
Evaluate and analysis(2) Applications’ executing speedup Compared to original OS LLC miss rate: with 15.4% performance improvement LLC miss delay: with 19.8% performance improvement Application groups with higher heterogeneity get higher performance improvement The third group, with highest improvement The second group, with lowest improvement