430 likes | 647 Views
ICIEV 2014 Dhaka University, Bangladesh. Invited Talk 12: “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!â€. Presented by: Dr. Abu Asaduzzaman Assistant Professor in Computer Architecture and Director of CAPPLab
E N D
ICIEV 2014 Dhaka University, Bangladesh Invited Talk 12: “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Presented by: Dr. Abu Asaduzzaman Assistant Professor in Computer Architecture and Director of CAPPLab Department of Electrical Engineering and Computer Science (EECS) Wichita State University (WSU), USA May 23, 2014
“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GPUs • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion QUESTIONS? Any time, please!
Introduction Single-Core to Multicore Architecture • History of Computing • Word “computer” in 1613 (this is not the beginning) • Von Neumann architecture (1945) – data/instructions memory • Harvard architecture (1944) – data memory, instruction memory • Single-Core Processors • In most modern processors: split CL1 (I1, D1), unified CL2, … • Intel Pentium 4, AMD Athlon Classic, … • Popular Programming Languages • C, …
Introduction (Single-Core to) Multicore Architecture • Input • Process/Store • Output Cache not shown Multi-tasking • Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University
Introduction Single-Core “Core” A thread is a running “process” a single core Courtesy: Jernej Barbič, Carnegie Mellon University
4: Integer Operation Arithmetic Logic Unit 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch (5) Result Write Back Floating Point Operation Introduction Thread 1: Integer (INT) Operation (Pipelining Technique) Thread 1: Integer Operation
Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Introduction Thread 2: Floating Point (FP) Operation (Pipelining Technique) Thread 2: Floating Point Operation
Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Introduction Threads 1 and 2: INT and FP Operations (Pipelining Technique) POSSIBLE?
Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Performance Threads 1 and 2: INT and FP Operations (Pipelining Technique) POSSIBLE?
Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Performance Improvement Threads 1 and 3: Integer Operations POSSIBLE? Thread 3: Integer Operation
Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 3: Integer Operation Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Performance Improvement Threads 1 and 3: Integer Operations (Multicore) Core 1 POSSIBLE? Core 2
Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Integer Operation Thread 3: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Core 2 Floating Point Operation Thread 4: Floating Point Operation Performance Improvement Threads 1, 2, 3, and 4: INT & FP Operations (Multicore) Core 1 POSSIBLE?
Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Integer Operation Thread 3: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Core 2 Floating Point Operation Thread 4: Floating Point Operation More Performance? Threads 1, 2, 3, and 4: INT & FP Operations (Multicore) Core 1 POSSIBLE?
“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GPUs • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion
SMT enabled Multicore CPU with Manycore GPUforUltimate Performance! Parallel/Concurrent Computing Parallel Processing – It is not fun! • Let’s play a game: Paying the lunch bill together • Started with $30; spent $29 ($27 + $2) • Where did $1 go?
Performance Improvement Simultaneous Multithreading (SMT) • Thread • A running program (or code segment) is a process • Process processes / threads • Simultaneous Multithreading (SMT) • Multiple threads running in a single-processor at the same time • Multiple threads running in multiple processors at the same time • Multicore Programming Language supports • OpenMP, Open MPI, CUDA, …C
Performance Improvement Identify Challenges • Sequential data-independent problems • C[] A[] + B[] • C[5] A[5] + B[5] • A’[] A[] • A’[5] A[5] • SMT capable multicore processor; CUDA/GPU Technology Core 1 Core 2
Performance Improvement • CUDA/GPU Programming • GP-GPU Card • A GPU card with 16 streaming multiprocessors (SMs) • Inside each SM: • 32 cores • 64KB shared memory • 32K 32bit registers • 2 schedulers • 4 special function units • CUDA • GPGPU Programming Platform
Performance Improvement CPU-GPU Technology • Tasks/Data exchange mechanism • Serial Computations – CPU • Parallel Computations - GPU
Performance Improvement GPGPU/CUDA Technology • The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudaMalloc() cudaMemCpy()
Performance Improvement GPGPU/CUDA Technology • The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myFunc<<<Blocks, Threads>>>(parameters)
Performance Improvement GPGPU/CUDA Technology • The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy()
Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) • Matrix Multiplication Matrices Systems
Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) • Matrix Multiplication Execution Time Power Consumption
Performance Improvement Case Study 2 (data dependent computation without GPU/CUDA) • Heat Transfer on 2D Surface Execution Time Power Consumption
Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) • Fast Effective Lightning Strike Simulation • The lack of lightning strike protection for the composite materials limits their use in many applications.
Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) • Fast Effective Lightning Strike Simulation • Laplace’s Equation • Simulation • CPU Only • CPU/GPU w/o shared memory • CPU/GPU with shared memory
Performance Improvement Case Study 4 (MATLAB Vs GPU/CUDA) • Different simulation models • Traditional sequential program • CUDA program (no shared memory) • CUDA program (with shared memory) • Traditional sequential MATLAB • Parallel MATLAB • CUDA/C parallel programming of the finite difference method based Laplace’s equation demonstrate up to 257x speedup and 97% energy savings over a parallel MATLAB implementation while solving a 4Kx4K problem with reasonable accuracy.
Performance Improvement Identify More Challenges • Sequential data-independent problems • C[] A[] + B[] • C[5] A[5] + B[5] • A’[] A[] • A’[5] A[5] • SMT capable multicore processor; CUDA/GPU Technology • Sequential data-dependent problems • B’[] B[] • B’[5] {B[4], B[5], B[6]} • Communication needed • Core 1 and Core 2 Core 1 Core 2 Core 1 Core 2
Performance Improvement Develop Solutions • Task Regrouping • Create threads • Data Regrouping • Regroup data • Data for each thread • Threads with G2s first • Then, threads with G1s (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy()
Performance Improvement Assess the Solutions • What is the Key? • Synchronization • With synchronization • Without synchronization • Fast Vs. Accuracy • Threads with G2s first • Then, threads with G1s (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy()
“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GP-GPU • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion
Energy-Efficient Computing Kansas Unique Challenge • Climate and Energy • Protect environment from harms due to climate change • Save natural energy
Energy-Efficient Computing “Power” Analysis • CPU with multiple GPU • GPU usages vary • Power Requirements • NVIDIA GTX 460 (336-core) - 160W [1] • Tesla C2075 (448-core) - 235W [2] • Intel Core i7 860 (4-core, 8-thread) - 150-245W [3, 4] • Dynamic GPU Selection • Depending on • the “tasks”/threads • GPU usages CPU GPU GPU GPU
Energy-Efficient Computing CPU-to-GPU Memory Mapping • GPU Shared Memory • Improves performance • CPU to GPU global memory • GPU global to shared • Data Regrouping • CPU to GPU global memory
Teaching Low-Power HPC Systems Integrate Research into Education • CS 794 – Multicore Architectures Programming • Multicore Architecture • Simultaneous Multithreading • Parallel Programming • Moore’s law • Amdahl’s law • Gustafson’s law • Law of diminishing returns • Koomey's law
“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GP-GPU • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion
WSU CAPPLab CAPPLab • Computer Architecture & Parallel Programming Laboratory (CAPPLab) • Physical location: 245 Jabara Hall, Wichita State University • URL: http://www.cs.wichita.edu/~capplab/ • E-mail: capplab@cs.wichita.edu; Abu.Asaduzzaman@wichita.edu • Tel: +1-316-WSU-3927 • Key Objectives • Lead research in advanced-level computer architecture, high-performance computing, embedded systems, and related fields. • Teach advanced-level computer systems & architecture, parallel programming, and related courses.
WSU CAPPLab “People First” • Students • Kishore Konda Chidella, PhD Student • Mark P Allen, MS Student • Chok M. Yip, MS Student • Deepthi Gummadi, MS Student • Collaborators • Mr. John Metrow, Director of WSU HiPeCC • Dr. Larry Bergman, NASA Jet Propulsion Laboratory (JPL) • Dr. Nurxat Nuraje, Massachusetts Institute of Technology (MIT) • Mr. M. Rahman, Georgia Institute of Technology (Georgia Tech) • Dr. Henry Neeman, University of Oklahoma (OU)
WSU CAPPLab Resources • Hardware • 3 CUDA Servers – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory • 2 CUDA PCs – CPU: Xeon E5506, … • Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3, Kepler card) via remote access to WSU (HiPeCC) • 2 CUDA enabled Laptops • More … • Software • CUDA, OpenMP, and Open MPI (C/C++ support) • MATLAB, VisualSim, CodeWarrior, more (as may needed)
WSU CAPPLab Scholarly Activities • WSU became “CUDA Teaching Center” for 2012-13 • Grants from NSF, NVIDIA, M2SYS, Wiktronics • Teaching Computer Architecture and Parallel Programming • Publications • Journal: 21 published; 3 under preparation • Conference: 57 published; 2 under review; 6 under preparation • Book Chapter: 1 published; 1 under preparation • Outreach • USD 259 Wichita Public Schools • Wichita Area Technical and Community Colleges • Open to collaborate
WSU CAPPLab Research Grants/Activities • Grants • WSU: ORCA • NSF – KS NSF EPSCoR First Award • M2SYS-WSU Biometric Cloud Computing Research Grant • Teaching (Hardware/Financial) Award from NVIDIA • Teaching (Hardware/Financial) Award from Xilinx • Proposals • NSF: CAREER (working/pending) • NASA: EPSCoR (working/pending) • U.S.: Army, Air Force, DoD, DoE • Industry: Wiktronics LLC, NetApp Inc, M2SYS Technology
Thank You! ICIEV-2014; Dhaka, Bangladesh; 2014 “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” QUESTIONS? Contact: Abu AsaduzzamanE-mail: abuasaduzzaman@ieee.orgPhone: +1-316-978-5261 http://www.cs.wichita.edu/~capplab/