590 likes | 704 Views
Ubiquitous Parallelism. Are You Equipped To Code For Multi- and Many- Core Platforms?. Agenda. Introduction/Motivation Why Parallelism? Why now? Survey of Parallel Hardware CPUs vs. GPUs Conclusion How Can I Start?. Talk Goal.
E N D
Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?
Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start?
Talk Goal • Encourage undergraduates to answer the call to the era of parallelism • Education • Software Engineering
Why Parallelism? Why now? • You’ve already been exposed to parallelism • Bit Level Parallelism • Instruction Level Parallelism • Thread Level Parallelism
Why Parallelism? Why now? • Single-threaded performance has plateaued • Silicon Trends • Power Consumption • Heat Dissipation
Why Parallelism? Why now? • Issue: Power & Heat • Good: Cheaper to have more cores, but slower • Bad: Breaks hardware/software contract
Why Parallelism? Why now? • Hardware/Software Contract • Maintain backwards-compatibility with existing codes
Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start?
Personal Mobile Device Space iPhone 5 Galaxy S3
Personal Mobile Device Space 2 CPU cores/ 3 GPU cores iPhone 5 Galaxy S3
Personal Mobile Device Space 2 CPU cores/ 3 GPU cores 4 CPU cores/ 4 GPU cores iPhone 5 Galaxy S3
Desktop Space 16 CPU cores • Rare To Have “Single Core” CPU • Clock Speeds < 3.0 GHz • Power Wall • Heat Dissipation AMD Opteron 6272
Desktop Space • General Purpose • Power Efficient • High Performance • Not All Problems Can Be Done on GPU 2048 GPU Cores AMD Radeon 7970
Warehouse Space (HokieSpeed) • Each node: • 2x Intel Xeon 5645 (6 cores each) • 2x NVIDIA C2050 (448 GPUs each)
Warehouse Space (HokieSpeed) • Each node: • 2x Intel Xeon 5645 (6 cores each) • 2x NVIDIA C2050 (448 GPUs each) • 209 nodes
Warehouse Space (HokieSpeed) • Each node: • 2x Intel Xeon 5645 (6 cores each) • 2x NVIDIA C2050 (448 GPUs each) • 209 nodes • 2508 CPU cores • 187264 GPU cores
Convergence in Computing • Three Classes: • Warehouse • Desktop • Personal Mobile Device • Main Criteria • Power, Performance, Programmability
Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start?
What is a CPU? • CPU • SR71 Jet • Capacity • 2 passengers • Top Speed • 2200 mph
What is the GPU? • GPU • Boeing 747 • Capacity • 605 passengers • Top Speed • 570 mph
CPU Architecture • Latency Oriented (Speculation)
APU = CPU + GPU • Accelerated Processing Unit • Both CPU + GPU on the same die
CPUs, GPUs, APUs • How to handle parallelism? • How to extract performance? • Can I just throw processors at a problem?
CPUs, GPUs, APUs • Multi-threading (2-16 threads) • Massive multi-threading (100,000+) • Depends on Your Problem
Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start?
How Can I start? • CUDA Programming • You most likely have a CUDA enabled GPU if you have a recent NVIDIA card
How Can I start? • CPU or GPU Programming • Use OpenCL (your laptop could potentially run)
How Can I start? • Undergraduate research • Senior/Grad Courses: • CS 4234 – Parallel Computation • CS 5510 – Multiprocessor Programming • ECE 4504/5504 – Computer Architecture • CS 5984 – Advanced Computer Graphics
In Summary … • Parallelism is here to stay • How does this affect you? • How fast is fast enough? • Are we content with current computer performance?
Thank you! • Carlo del Mundo, • Senior, Computer Engineering • Website: http://filebox.vt.edu/users/cdel/ • E-mail: cdel@vt.edu Previous Internships @
Programming Models • pthreads • MPI • CUDA • OpenCL
pthreads • A UNIX API to create and destroy threads
MPI • A communications protocol • “Send and Receive” messages between nodes
CUDA • Massive multi-threading (100,000+) • Thread-level parallelism
OpenCL • Heterogeneous programming model that is catered to several devices (CPUs, GPUs, APUs)
Comparisons † Productivity is subjective and draws from my experiences
Parallel Applications • Vector Add • Matrix Multiplication
Vector Add • Serial • Loop N times • N cycles† • Parallel • Assume you have N cores • 1 cycles† † Assume 1 add = 1 cycle