Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures MichelaBecchi and Patrick Crowley Applied Research Lab Washington University in St. Louis

Context • Chip Multiprocessor (CMP): several processors on the same chip • Support high degree of Thread Level Parallelism • Overcome limitations of wide-issues superscalar uni-processor systems • Applications w/ limited Instruction Level Parallelism • Complexity, area occupancy and manufacturing costs • Heterogeneous CMP • Coexistence of differing cores and caches

Motivations • Area-complexity tradeoff in CMP • Many “simple” processors/caches • High thread level parallelism (TLP) • Few “sophisticated” processors/caches • High instruction level parallelism (ILP) • Multi-programmed computing environment • Computing needs vary • Across threads • Over time

Problem • Can Heterogeneous CMP be better? • Under which conditions/workload? • How to exploit hardware heterogeneity?

Goal • Heterogeneous CMP more flexible than homogenous CMP • Thread diversity • Applications w/ “multiphase” behavior • Varying degree of thread level parallelism • Dynamic core assignment and thread migration

Approach • Simulation • Mix of event based and trace based simulation • Analysis of heterogeneous set of benchmarks on two different processors (Alpha 21164 and Alpha 21264) • Simulation of CMP configurations: • Homogeneous vs. heterogeneous • Static vs. two dynamic assignment policies

Hardware setup • Same Instruction Set Architecture • Mono-threaded • Unified L2 cache (4MB/4-way/128B blocks) • Main memory – L2: 2GB/s bus • 2.1 GHz clock

Workload definition • 11 programs from SPEC2000, ref input set • INT: gzip, gcc, crafty, parser, bzip2 • FP: wupwise, swim, mgrid, galgel, equake, lucas • # of running threads: from 1 to 40 • Data points: average across 100 simulations on random workload selections

Benchmarks behavior: EV6 vs. EV5 • M5 uni-processor simulations • 2.5 B instructions executed • Relative statistics on windows of 1M clock cycles • Results: • IPC (instruction per clock cycle): EV5 from 0.35 to 1.15, EV6 from 0.45 to 1.8 • Branch predictor accuracy: no remarkable variation • L1 cache misses: varying impact across programs and not directly correlated to IPC

gzip gzip gzip gzip gzip crafty gzip gzip gzip gzip gzip lucas EV6 EV6 EV5 EV5 EV5 EV5 EV5 EV5 x5 IPC ratios EV6 vs. EV5

CMP systems • Core configurations (100mm2 area) • homogeneous: 4EV6, 20EV5 • heterogeneous”: 1EV6&15EV5, 2EV6&10EV5, 3EV6&5EV5 • Assignment policies: • Random and pseudo best static • Round-Robin • IPC driven • Thread migration modeled as a context switch • Upper bound to # of clock cycles required to transfer architectural state and refill caches • Comparison Metric: speedup in respect to one EV6

Homogeneous Configurations Low thread parallelism High thread parallelism

Static assignment • Tasks statically assigned to cores • 2 flavors: • Random • Pseudo-best • Runtime characteristics of tasks known in advance • Term of comparison • Heuristic: • Sort tasks basing on IPC on two cores • Assign to EV6 twice as many threads as EV5 • Drawbacks: • Idle EV6s remain idle (unless unassigned threads) • Slow threads on EV5 penalize overall performance

Quality of static assignments Random: homogeneous better than heterogeneous

Quality of static assignments Random: homogeneous better than heterogeneous Best: a priori knowledge of benchmark characteristics

Round Robin assignment • Dynamic assignment policy • Periodic rotation of threads on cores: • swap_period • # EV6s < # EV5s => several cycles for complete rotation • Pros: • EV6s never idle • More load balancing • Cons: • Runtime behavior of threads ignored

Round Robin vs. static assignment • RR better than static over all degrees of TLP • RR w/ 10 EV5s ~ homogeneous w/ 20 EV5s for high degree TLP

IPC Driven assignment • Dynamic assignment policy • Goal: assign to EV6s jobs having a greater speedup on them • EV6/EV5 IPC ratio as control metric • Three causes of migration • Learning (forced migration) • EV6 core becoming idle • Variation in IPC ratios (IPC-driven migration)

Dynamic assignments IPC-driven assignment better for high TLP Limited performance increasemay not justify complicated schemes

# threads ≤ #cores Effect of load balancing independent of dyn. policy # threads ≤ # EV6s NO reassignment # threads ~ # cores NO load balancing # threads > # EV6s Load balancing Components of Speedup

Conclusions • Analysis • Multi-programmed computing environment (from SPEC2000) • Two homogeneous and three heterogeneous CMP configurations (two cores) • Two static and two dynamic assignment policies • Dynamic assignment policy on heterogeneous CMP configuration • accommodates broad range of degrees of thread parallelism • outperforms static assignment of 20% to 40% on average (80% in extreme cases) • a simple Round Robin policy can suffice, especially in case of limited degree of thread level parallelism

Questions Thanks • Dr. Patrick Crowley • Applied Research Lab and Storage Based Supercomputing Group at Washington University in St. Louis • Anonymous Reviewers • YOU ALL!

Forced migrations • different programs have different phase durations • phases changes observed on different cores at the same time Variation of IPC as triggering factor • Initially • According to “program phases”

Homogeneous vs. Heterogeneous - random static

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

Presentation Transcript

Trends in Multiprocessor Thread Schedulers

Performance Analysis of Multiprocessor Architectures

Scalable Thread Scheduling and Global Power Management for Heterogeneous Many-Core Architectures

Exploiting Heterogeneous Architectures

Heterogeneous Thread Assignment Simulation

Dynamic Thread Mapping for High-Performance, Power-Efficient Heterogeneous Many-core Systems

Portable Performance on Heterogeneous Architectures

Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures

Parallelizing Iterative Computation for Multiprocessor Architectures

Module 4 Multiprocessor architectures and programming

Parallel and Multiprocessor Architectures

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems

Signalling in the Heterogeneous Architecture Multiprocessor Paradigm

Heterogeneous Chip Multiprocessor Design for Virtual Machines

HETEROGENEOUS ARCHITECTURES

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

Multiprocessor Architectures

Module 4 Multiprocessor architectures and programming

Parallelizing Iterative Computation for Multiprocessor Architectures

ASPI9-2: DSP Multiprocessor Architectures mm5