1 / 23

Exploiting Heterogeneous Architectures

Exploiting Heterogeneous Architectures . Alex Beutel , John Dickerson, Vagelis Papalexakis 15-740/18-740 Computer Architecture, Fall 2012 In Class Discussion Tuesday 10/16/2012. Heterogeneous Hardware Systems. Multiple CPU Single GPU systems Asymmetric Multicore Processors (AMP)

yovela
Download Presentation

Exploiting Heterogeneous Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Heterogeneous Architectures Alex Beutel, John Dickerson, Vagelis Papalexakis 15-740/18-740 Computer Architecture, Fall 2012 In Class Discussion Tuesday 10/16/2012

  2. Heterogeneous Hardware Systems • Multiple CPU Single GPU systems • Asymmetric Multicore Processors (AMP) • Combination of general-purpose big and small cores • Trade-off between performance & power consumption • Usually “on chip” AMP’s • Single-ISA architectures • Similar to AMP but have same instruction sets among cores • “small” processors can support in-order execution • “big ones can support out-of-order execution

  3. Overview

  4. Outline • BIS: Jos A. Joao et al. “Bottleneck identification and scheduling in multithreaded applications,” in Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS ’12). • YinYang: Ting Caoet al. “The yin and yang of power and performance for asymmetric hardware and managed software,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). • PIE: Kenzo Van Craeynestet al. “Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE),” in Pro- ceedings of the 39th International Symposium on Computer Architecture (ISCA ’12). • SMS: RachataAusavarungnirunet al. “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” in Proceedings of the 39th International Symposium on Computer Architecture (ISCA ’12).

  5. Bottleneck Identification and Scheduling in Multithreaded Applications • Focuses on the problem of removing bottlenecks • Big problem in many systems – can’t scale well to many threads • Bottlenecks include critical sections, pipeline stalls, barriers are a few examples • In ACMPs previous research shows that “big” cores can be used to handle (serializing) bottlenecks • Limited fine grain adaptivity and generality • Authors propose BIS • Key insight – costliest bottlenecks are those that make other threads wait longest • involves co-operation of software and hardware to detect bottlenecks • Accelerates them using 1 or more “big” cores of the ACMP

  6. Bottlenecks • Amdahl’s serial portion • Critical sections • Barriers • Pipeline stages

  7. Bottleneck Identification • Software used to identify bottlenecks • Instructions such as BottleneckCall, BottleneckReturn, BottleneckWait give feedback to BIS system • BIS system keeps track of blocks and thread waiting cycles (TWC) (with optimizations)

  8. Scheduling in Multithreaded Applications (with ACMPs) • Take N bottlenecks with highest TWC and accelerate • Many methods to accelerate, they focus on assigning to bigger cores in ACMP • Send worst bottlenecks from small cores to big core and keep in Scheduling Buffer • Lots of edge cases dealt with such as avoiding false serialization • Also extend to multiple large core context

  9. Bottleneck Identification and Scheduling in Multithreaded Applications

  10. The Yin/Yang Metaphor • Hardware: heterogeneous multi-core balances power and performance • Everyone cares about performance-per-energy (PPE) instead of absolute performance • Software: move toward managed programming languages with virtual machines, like Java (JVM), C# (.NET), JavaScript, • Yang of heterogeneous: exposed hardware adds complexity • Yin of managed language: VM handles all that exposed complexity for the programmer • Yang of VM languages is overhead • Yin of heterogeneous hardware is small cores can alleviate that overhead problem

  11. Yin and Yang of Power and Performance: Overview • Virtual machines consume a ton of extra computation time and energy (~40%) • Java VM-related numbers (~37%): • 10% garbage collection • 12% JIT • 15% executing untouched instructions via the interpreter • Paper: exploits GC, JIT, Interpreter tasks by placing them on the right types of cores with combination of parallelism, asynchrony, non-criticality, and hardware sensitivity

  12. Yin and Yang of Power and Performance: Overview • Garbage collection – asynchronous, can use many cores, does not benefit from high clock rate. Use low power core with high memory bandwidth • JIT – async, some parallelism, and non-critical. Use small core because powerful enough • Interpreter – Critical path and not async. Uses the applications parallelism. Again use low power cores generally

  13. YinYang Experimental Evaluation • Power: they measure power overhead of VM services, and yes the VM eats power so it's a good candidate for heterogeneous systems • Power-per-energy (PPE): lots of results reported like this instead of absolute. • Moving the JIT and GC to lower-clocked cores increases PPE (by 9-13%) • GC very memory-bound, great on low-power cores • JIT less memory-bound, but embarrassingly parallel so still great on low-power cores • Interpreter PPE improvement less stark, but still there.

  14. YinYang Experimental Evaluation

  15. PIE: Performance Impact Estimation • A heterogeneous multi-core architecture is one that features big, powerful, power-hungry core(s) and small, weak, energy-efficient core(s). • How do we map workloads onto the appropriate cores to maximize "speed-per-energy" • PIE is a static or dynamic scheduler that takes both memory-level parallelism and instruction-level parallelism into account to predict how well a job will do on different types of cores. • Static: schedule jobs once for duration of job • Dynamic: push parts of jobs to appropriate cores

  16. PIE (contd.) • Motivation: Intuition is wrong: • “Compute-heavy jobs should go on the heavyweight 'big' cores, while memory-heavy jobs can do well enough on 'small' cores.” • Authors do experiments and find out that big cores do well on MLP intensive jobs, while small cores do well on ILP intensive jobs

  17. More PIE • One way to schedule is to randomly sample job-core mappings, learn, choose best • High overhead! • PIE instead tries to estimate a job's performance on Core-type B, while the job is running on Core-type A • Estimate is based on aggregate of: • Base instruction-level parallelism (ILP) score • A memory-level parallelism (MLP) component that is a function of the processor's architecture and the observed cache misses from specific job • Scheduler takes estimate of job's performance on different processor types, decides where to put the job

  18. PIE works well

  19. Multiple CPU Single GPU Systems • Memory becomes critical resource • GPU accesses vastly different from CPU ones • GPUs generate significantly more requests • GPU spawns many different threads • Increased contention between GPU and CPU • Need to design a Memory Controller • Schedules the memory accesses • Ensures fairness • Is scalable and easy to implement • Current approaches non robust to presence of both GPU and CPUs

  20. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems • Proposes a multi-stage approach to application-aware memory scheduling • Handles interference between bandwidth demanding apps and non demanding ones (e.g. GPU and CPU respectively) • Simplified hardware implementation due to decoupling of memory controller across multiple stages • Improves CPU performance without degrading GPU performance • Authors test on many settings (e.g. CPU only, GPU only, CPU & GPU) • Compare to existing approaches

  21. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems • Sophisticated approaches that prioritize memory accesses need too complex logic • E.g. CAM memories • SMS uses a three stage approach: • Batch Formulation: Per source aggregation of memory request batches • Batch Scheduler: Prioritize batches coming from latency critical apps (e.g. CPU ones) • DRAM Command Scheduler: FIFO queues per DRAM bank/each batch from Stage2 is placed on these FIFOs

  22. SMS works well

  23. Discussion • Leaving ISA assumption, how can we combine ideas from first three papers? • Seems like we can incorporate ILP and MLP in our queuing decisions in BIS • VM can also become more dynamic • BIS, PIE, and YinYang papers assume heterogeneous multi-core systems with the same instruction set architecture (ISA). How would these papers change if we assumed different ISAs? • I'm thinking not fundamentally. It would make the prediction part of PIE more complicated (you'd need some context-aware performance scaling between ISAs), but wouldn't break it. • Similarly, you'd need some scaling for VM stuff, but it wouldn't break anything there, either. • Maybe the class has an opinion on this? Do we need to keep ISAs homogeneous across a heterogeneous multi-core? What do we gain or lose from this?

More Related