230 likes | 365 Views
Exploiting Heterogeneous Architectures . Alex Beutel , John Dickerson, Vagelis Papalexakis 15-740/18-740 Computer Architecture, Fall 2012 In Class Discussion Tuesday 10/16/2012. Heterogeneous Hardware Systems. Multiple CPU Single GPU systems Asymmetric Multicore Processors (AMP)
E N D
Exploiting Heterogeneous Architectures Alex Beutel, John Dickerson, Vagelis Papalexakis 15-740/18-740 Computer Architecture, Fall 2012 In Class Discussion Tuesday 10/16/2012
Heterogeneous Hardware Systems • Multiple CPU Single GPU systems • Asymmetric Multicore Processors (AMP) • Combination of general-purpose big and small cores • Trade-off between performance & power consumption • Usually “on chip” AMP’s • Single-ISA architectures • Similar to AMP but have same instruction sets among cores • “small” processors can support in-order execution • “big ones can support out-of-order execution
Outline • BIS: Jos A. Joao et al. “Bottleneck identification and scheduling in multithreaded applications,” in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS ’12). • YinYang: Ting Caoet al. “The yin and yang of power and performance for asymmetric hardware and managed software,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). • PIE: Kenzo Van Craeynestet al. “Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE),” in Pro- ceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). • SMS: RachataAusavarungnirunet al. “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12).
Bottleneck Identification and Scheduling in Multithreaded Applications • Focuses on the problem of removing bottlenecks • Big problem in many systems – can’t scale well to many threads • Bottlenecks include critical sections, pipeline stalls, barriers are a few examples • In ACMPs previous research shows that “big” cores can be used to handle (serializing) bottlenecks • Limited fine grain adaptivity and generality • Authors propose BIS • Key insight – costliest bottlenecks are those that make other threads wait longest • involves co-operation of software and hardware to detect bottlenecks • Accelerates them using 1 or more “big” cores of the ACMP
Bottlenecks • Amdahl’s serial portion • Critical sections • Barriers • Pipeline stages
Bottleneck Identification • Software used to identify bottlenecks • Instructions such as BottleneckCall, BottleneckReturn, BottleneckWait give feedback to BIS system • BIS system keeps track of blocks and thread waiting cycles (TWC) (with optimizations)
Scheduling in Multithreaded Applications (with ACMPs) • Take N bottlenecks with highest TWC and accelerate • Many methods to accelerate, they focus on assigning to bigger cores in ACMP • Send worst bottlenecks from small cores to big core and keep in Scheduling Buffer • Lots of edge cases dealt with such as avoiding false serialization • Also extend to multiple large core context
Bottleneck Identification and Scheduling in Multithreaded Applications
The Yin/Yang Metaphor • Hardware: heterogeneous multi-core balances power and performance • Everyone cares about performance-per-energy (PPE) instead of absolute performance • Software: move toward managed programming languages with virtual machines, like Java (JVM), C# (.NET), JavaScript, • Yang of heterogeneous: exposed hardware adds complexity • Yin of managed language: VM handles all that exposed complexity for the programmer • Yang of VM languages is overhead • Yin of heterogeneous hardware is small cores can alleviate that overhead problem
Yin and Yang of Power and Performance: Overview • Virtual machines consume a ton of extra computation time and energy (~40%) • Java VM-related numbers (~37%): • 10% garbage collection • 12% JIT • 15% executing untouched instructions via the interpreter • Paper: exploits GC, JIT, Interpreter tasks by placing them on the right types of cores with combination of parallelism, asynchrony, non-criticality, and hardware sensitivity
Yin and Yang of Power and Performance: Overview • Garbage collection – asynchronous, can use many cores, does not benefit from high clock rate. Use low power core with high memory bandwidth • JIT – async, some parallelism, and non-critical. Use small core because powerful enough • Interpreter – Critical path and not async. Uses the applications parallelism. Again use low power cores generally
YinYang Experimental Evaluation • Power: they measure power overhead of VM services, and yes the VM eats power so it's a good candidate for heterogeneous systems • Power-per-energy (PPE): lots of results reported like this instead of absolute. • Moving the JIT and GC to lower-clocked cores increases PPE (by 9-13%) • GC very memory-bound, great on low-power cores • JIT less memory-bound, but embarrassingly parallel so still great on low-power cores • Interpreter PPE improvement less stark, but still there.
PIE: Performance Impact Estimation • A heterogeneous multi-core architecture is one that features big, powerful, power-hungry core(s) and small, weak, energy-efficient core(s). • How do we map workloads onto the appropriate cores to maximize "speed-per-energy" • PIE is a static or dynamic scheduler that takes both memory-level parallelism and instruction-level parallelism into account to predict how well a job will do on different types of cores. • Static: schedule jobs once for duration of job • Dynamic: push parts of jobs to appropriate cores
PIE (contd.) • Motivation: Intuition is wrong: • “Compute-heavy jobs should go on the heavyweight 'big' cores, while memory-heavy jobs can do well enough on 'small' cores.” • Authors do experiments and find out that big cores do well on MLP intensive jobs, while small cores do well on ILP intensive jobs
More PIE • One way to schedule is to randomly sample job-core mappings, learn, choose best • High overhead! • PIE instead tries to estimate a job's performance on Core-type B, while the job is running on Core-type A • Estimate is based on aggregate of: • Base instruction-level parallelism (ILP) score • A memory-level parallelism (MLP) component that is a function of the processor's architecture and the observed cache misses from specific job • Scheduler takes estimate of job's performance on different processor types, decides where to put the job
Multiple CPU Single GPU Systems • Memory becomes critical resource • GPU accesses vastly different from CPU ones • GPUs generate significantly more requests • GPU spawns many different threads • Increased contention between GPU and CPU • Need to design a Memory Controller • Schedules the memory accesses • Ensures fairness • Is scalable and easy to implement • Current approaches non robust to presence of both GPU and CPUs
Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems • Proposes a multi-stage approach to application-aware memory scheduling • Handles interference between bandwidth demanding apps and non demanding ones (e.g. GPU and CPU respectively) • Simplified hardware implementation due to decoupling of memory controller across multiple stages • Improves CPU performance without degrading GPU performance • Authors test on many settings (e.g. CPU only, GPU only, CPU & GPU) • Compare to existing approaches
Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems • Sophisticated approaches that prioritize memory accesses need too complex logic • E.g. CAM memories • SMS uses a three stage approach: • Batch Formulation: Per source aggregation of memory request batches • Batch Scheduler: Prioritize batches coming from latency critical apps (e.g. CPU ones) • DRAM Command Scheduler: FIFO queues per DRAM bank/each batch from Stage2 is placed on these FIFOs
Discussion • Leaving ISA assumption, how can we combine ideas from first three papers? • Seems like we can incorporate ILP and MLP in our queuing decisions in BIS • VM can also become more dynamic • BIS, PIE, and YinYang papers assume heterogeneous multi-core systems with the same instruction set architecture (ISA). How would these papers change if we assumed different ISAs? • I'm thinking not fundamentally. It would make the prediction part of PIE more complicated (you'd need some context-aware performance scaling between ISAs), but wouldn't break it. • Similarly, you'd need some scaling for VM stuff, but it wouldn't break anything there, either. • Maybe the class has an opinion on this? Do we need to keep ISAs homogeneous across a heterogeneous multi-core? What do we gain or lose from this?