240 likes | 485 Views
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures. Michela Becchi and Patrick Crowley Applied Research Lab Washington University in St. Louis. Context. Chip Multiprocessor (CMP): several processors on the same chip Support high degree of Thread Level Parallelism
E N D
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures MichelaBecchi and Patrick Crowley Applied Research Lab Washington University in St. Louis
Context • Chip Multiprocessor (CMP): several processors on the same chip • Support high degree of Thread Level Parallelism • Overcome limitations of wide-issues superscalar uni-processor systems • Applications w/ limited Instruction Level Parallelism • Complexity, area occupancy and manufacturing costs • Heterogeneous CMP • Coexistence of differing cores and caches
Motivations • Area-complexity tradeoff in CMP • Many “simple” processors/caches • High thread level parallelism (TLP) • Few “sophisticated” processors/caches • High instruction level parallelism (ILP) • Multi-programmed computing environment • Computing needs vary • Across threads • Over time
Problem • Can Heterogeneous CMP be better? • Under which conditions/workload? • How to exploit hardware heterogeneity?
Goal • Heterogeneous CMP more flexible than homogenous CMP • Thread diversity • Applications w/ “multiphase” behavior • Varying degree of thread level parallelism • Dynamic core assignment and thread migration
Approach • Simulation • Mix of event based and trace based simulation • Analysis of heterogeneous set of benchmarks on two different processors (Alpha 21164 and Alpha 21264) • Simulation of CMP configurations: • Homogeneous vs. heterogeneous • Static vs. two dynamic assignment policies
Hardware setup • Same Instruction Set Architecture • Mono-threaded • Unified L2 cache (4MB/4-way/128B blocks) • Main memory – L2: 2GB/s bus • 2.1 GHz clock
Workload definition • 11 programs from SPEC2000, ref input set • INT: gzip, gcc, crafty, parser, bzip2 • FP: wupwise, swim, mgrid, galgel, equake, lucas • # of running threads: from 1 to 40 • Data points: average across 100 simulations on random workload selections
Benchmarks behavior: EV6 vs. EV5 • M5 uni-processor simulations • 2.5 B instructions executed • Relative statistics on windows of 1M clock cycles • Results: • IPC (instruction per clock cycle): EV5 from 0.35 to 1.15, EV6 from 0.45 to 1.8 • Branch predictor accuracy: no remarkable variation • L1 cache misses: varying impact across programs and not directly correlated to IPC
gzip gzip gzip gzip gzip crafty gzip gzip gzip gzip gzip lucas EV6 EV6 EV5 EV5 EV5 EV5 EV5 EV5 x5 IPC ratios EV6 vs. EV5
CMP systems • Core configurations (100mm2 area) • homogeneous: 4EV6, 20EV5 • heterogeneous”: 1EV6&15EV5, 2EV6&10EV5, 3EV6&5EV5 • Assignment policies: • Random and pseudo best static • Round-Robin • IPC driven • Thread migration modeled as a context switch • Upper bound to # of clock cycles required to transfer architectural state and refill caches • Comparison Metric: speedup in respect to one EV6
Homogeneous Configurations Low thread parallelism High thread parallelism
Static assignment • Tasks statically assigned to cores • 2 flavors: • Random • Pseudo-best • Runtime characteristics of tasks known in advance • Term of comparison • Heuristic: • Sort tasks basing on IPC on two cores • Assign to EV6 twice as many threads as EV5 • Drawbacks: • Idle EV6s remain idle (unless unassigned threads) • Slow threads on EV5 penalize overall performance
Quality of static assignments Random: homogeneous better than heterogeneous
Quality of static assignments Random: homogeneous better than heterogeneous Best: a priori knowledge of benchmark characteristics
Round Robin assignment • Dynamic assignment policy • Periodic rotation of threads on cores: • swap_period • # EV6s < # EV5s => several cycles for complete rotation • Pros: • EV6s never idle • More load balancing • Cons: • Runtime behavior of threads ignored
Round Robin vs. static assignment • RR better than static over all degrees of TLP • RR w/ 10 EV5s ~ homogeneous w/ 20 EV5s for high degree TLP
IPC Driven assignment • Dynamic assignment policy • Goal: assign to EV6s jobs having a greater speedup on them • EV6/EV5 IPC ratio as control metric • Three causes of migration • Learning (forced migration) • EV6 core becoming idle • Variation in IPC ratios (IPC-driven migration)
Dynamic assignments IPC-driven assignment better for high TLP Limited performance increasemay not justify complicated schemes
# threads ≤ #cores Effect of load balancing independent of dyn. policy # threads ≤ # EV6s NO reassignment # threads ~ # cores NO load balancing # threads > # EV6s Load balancing Components of Speedup
Conclusions • Analysis • Multi-programmed computing environment (from SPEC2000) • Two homogeneous and three heterogeneous CMP configurations (two cores) • Two static and two dynamic assignment policies • Dynamic assignment policy on heterogeneous CMP configuration • accommodates broad range of degrees of thread parallelism • outperforms static assignment of 20% to 40% on average (80% in extreme cases) • a simple Round Robin policy can suffice, especially in case of limited degree of thread level parallelism
Questions Thanks • Dr. Patrick Crowley • Applied Research Lab and Storage Based Supercomputing Group at Washington University in St. Louis • Anonymous Reviewers • YOU ALL!
Forced migrations • different programs have different phase durations • phases changes observed on different cores at the same time Variation of IPC as triggering factor • Initially • According to “program phases”