1 / 9

Course Goals

Dive into the theory of parallel algorithms, design efficient solutions, and analyze performance to achieve significant speed-ups on real hardware. Explore the integration of CPUs and GPUs, the impact on mainstream computing, and the evolution towards XMT architectures. Discover the balance between fine-grained access and locality for optimized data movement and bandwidth utilization. Anticipate the future of general-purpose parallel programming and the convergence of CPU and GPU technologies.

talamantes
Download Presentation

Course Goals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Course Goals 1. Introduction to the theory of parallel algorithms Parallel algorithmic thinking obtaining good speed-ups over best serial algorithm. Class presentations Study the theory of parallel algorithms; design & asymptotic analysis of parallel algorithms. Programming Hard speedups on real HW; “most” advanced algorithms  understanding; YOU can do it; Off-line… 2. Feasibility of on-chip general-purpose parallel computer Focus Single task completion. Throughput: important, but different. Overview 3. Affecting CPUs and/or GPUs. Report. • Greater integration of CPU and parallel processors • More shared memory over local memory

  2. Is mainstream computing on an XMT trajectory? Argue Happening through: • Integration • Support of fine-grained irregular apps: - (Beyond deep) machine learning* • Business competition *SW driving market-place machine learning apps: XMT 3.3X over top-of-the-line GPU. GPU improved multi-core

  3. Tighter integration CPU&parallel execution Multi-cores Integrated CPU and integrated graphics; e.g., 72 graphics processors (GrPU) on chip in 2017 (Intel). Issue Parallelism executed by contractors vs subcontractors Image Independent players vs Conductor/pianist+Orchestra Classic multi-core Contractors--all processors equal Recent multi-cores CPU: conductor of GrPUs. CPU↔GrPUs: tight private↔shared cache transition GPUs Past CPU ↔ GPU exchange jobs “over a fence” Uniform address space as of the 2010s (Nvidia) But, there is more…

  4. Traditional GPUs: Optimize pico-Joules Optimize irregular parallel algorithms Dally 20091Motivation: Data motion is energy costly Dally, US patent 20152 Local caches mean separate memories, separate functionalities. Bad use of capacities (e.g., bandwidth) Combine several logically separate memories into a single unified memory Single set of shared memory banks. Not local! Dynamic allocation of bandwidth 3Similar to the UMD PRAM-On-Chip XMT architecture balance of fine-grained access and locality! • Minimize expensive data movement • Optimize use of scarce bandwidth • Provide rich, explicitly managed storage hierarchy to reduce demand, increase utilization • “Efficiency = Locality”  Far behind on irregular apps. We were not shy pointing this out… 1 Source: Keynote “End of denial architectures”. (Per 2015 patent: “Traditional GPUs”) 2Patent Unified Streaming Multiprocessor Memory reflects increasing size of shared caches in GPUs starting ~2012 with much better performance on irregular parallel programs 3See the streaming-multiprocessor of the Nov’17 Nvidia Volta, including register file.

  5. More integration or more discrete parallel accelerators? Discrete accelerators Integrated parallelism CPUs takes some silicon. But if CPU count is low, need not fall behind on peak. Can do better in mixed serial/parallel tasks Less power • Dedicated hardware towards specific app can be greatly optimized

  6. (Approximate) Chronology of business competition Integrated solution Discrete solution CPU + graphics card. Winner CPU + GPU. Winner CPU + (“discrete”) GPU Still: winner in the high-end Accommodates both! Can matrix-mult-type suffice? the Machine-learning court? 1990s ISA SIMD extensions MMX 2000s ? 2010s Integrated GPU Iris/Pro/Plus, HD Winner in the low-/mid-end &mobile Open CL? 2020s Winner. Unless … new apps Is the ball in For fun: contrast with Wall Street

  7. OpenCLTM(Open Computing Language) “A multi-vendor open standard for general-purpose parallel programming of heterogeneous systems that include CPUs, GPUs, and other processors. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high performance compute servers, desktop computer systems, and handheld devices” More level playing field around general-purpose parallel programming?

  8. Anticipate Mounting pressure towards support of: • Ease of programming; Math induction? • Fine-grained irregular parallel programs; GPU papers… • Support Parallel (PRAM) algorithms • Teaching every CS major some notion of parallelism Time will tell whether, how and when this will unfold Wild cardCompetition in the CPU space. However, I foresee a strong XMT/PRAM trajectory.

  9. Concrete direction • https://software.intel.com/en-us/vtune-amplifier-help-gpu-opencl-application-analysis-view

More Related