Invited Talk 12:

ICIEV 2014 Dhaka University, Bangladesh Invited Talk 12: “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Presented by: Dr. Abu Asaduzzaman Assistant Professor in Computer Architecture and Director of CAPPLab Department of Electrical Engineering and Computer Science (EECS) Wichita State University (WSU), USA May 23, 2014

“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GPUs • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion QUESTIONS? Any time, please!

Introduction Single-Core to Multicore Architecture • History of Computing • Word “computer” in 1613 (this is not the beginning) • Von Neumann architecture (1945) – data/instructions memory • Harvard architecture (1944) – data memory, instruction memory • Single-Core Processors • In most modern processors: split CL1 (I1, D1), unified CL2, … • Intel Pentium 4, AMD Athlon Classic, … • Popular Programming Languages • C, …

Introduction (Single-Core to) Multicore Architecture • Input • Process/Store • Output Cache not shown Multi-tasking • Time sharing (Juggling!) Courtesy: Jernej Barbič, Carnegie Mellon University

Introduction Single-Core  “Core” A thread is a running “process” a single core Courtesy: Jernej Barbič, Carnegie Mellon University

4: Integer Operation Arithmetic Logic Unit 1: Instruction Fetch 2: Instruction Decode (3) Operand(s) Fetch (5) Result Write Back Floating Point Operation Introduction Thread 1: Integer (INT) Operation (Pipelining Technique) Thread 1: Integer Operation

Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Introduction Thread 2: Floating Point (FP) Operation (Pipelining Technique) Thread 2: Floating Point Operation

Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Introduction Threads 1 and 2: INT and FP Operations (Pipelining Technique) POSSIBLE?

Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Performance Threads 1 and 2: INT and FP Operations (Pipelining Technique) POSSIBLE?

Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Performance Improvement Threads 1 and 3: Integer Operations POSSIBLE? Thread 3: Integer Operation

Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 3: Integer Operation Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Performance Improvement Threads 1 and 3: Integer Operations (Multicore) Core 1 POSSIBLE? Core 2

Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Integer Operation Thread 3: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Core 2 Floating Point Operation Thread 4: Floating Point Operation Performance Improvement Threads 1, 2, 3, and 4: INT & FP Operations (Multicore) Core 1 POSSIBLE?

Integer Operation Thread 1: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Floating Point Operation Thread 2: Floating Point Operation Integer Operation Thread 3: Integer Operation Arithmetic Logic Unit Instruction Fetch Instruction Decode Operand(s) Fetch Result Write Back Core 2 Floating Point Operation Thread 4: Floating Point Operation More Performance? Threads 1, 2, 3, and 4: INT & FP Operations (Multicore) Core 1 POSSIBLE?

“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GPUs • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion

SMT enabled Multicore CPU with Manycore GPUforUltimate Performance! Parallel/Concurrent Computing Parallel Processing – It is not fun! • Let’s play a game: Paying the lunch bill together • Started with $30; spent $29 ($27 + $2) • Where did $1 go?

Performance Improvement Simultaneous Multithreading (SMT) • Thread • A running program (or code segment) is a process • Process  processes / threads • Simultaneous Multithreading (SMT) • Multiple threads running in a single-processor at the same time • Multiple threads running in multiple processors at the same time • Multicore Programming Language supports • OpenMP, Open MPI, CUDA, …C

Performance Improvement Identify Challenges • Sequential data-independent problems • C[]  A[] + B[] • C[5]  A[5] + B[5] • A’[]  A[] • A’[5]  A[5] • SMT capable multicore processor; CUDA/GPU Technology Core 1 Core 2

Performance Improvement • CUDA/GPU Programming • GP-GPU Card • A GPU card with 16 streaming multiprocessors (SMs) • Inside each SM: • 32 cores • 64KB shared memory • 32K 32bit registers • 2 schedulers • 4 special function units • CUDA • GPGPU Programming Platform

Performance Improvement CPU-GPU Technology • Tasks/Data exchange mechanism • Serial Computations – CPU • Parallel Computations - GPU

Performance Improvement GPGPU/CUDA Technology • The host (CPU) executes a kernel in GPU in 4 steps (Step 1) CPU allocates and copies data to GPU On CUDA API: cudaMalloc() cudaMemCpy()

Performance Improvement GPGPU/CUDA Technology • The host (CPU) executes a kernel in GPU in 4 steps (Step 2) CPU Sends function parameters and instructions to GPU CUDA API: myFunc<<<Blocks, Threads>>>(parameters)

Performance Improvement GPGPU/CUDA Technology • The host (CPU) executes a kernel in GPU in 4 steps (Step 3) GPU executes instruction as scheduled in warps (Step 4) Results will need to be copied back to Host memory (RAM) using cudaMemCpy()

Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) • Matrix Multiplication Matrices Systems

Performance Improvement Case Study 1 (data independent computation without GPU/CUDA) • Matrix Multiplication Execution Time Power Consumption

Performance Improvement Case Study 2 (data dependent computation without GPU/CUDA) • Heat Transfer on 2D Surface Execution Time Power Consumption

Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) • Fast Effective Lightning Strike Simulation • The lack of lightning strike protection for the composite materials limits their use in many applications.

Performance Improvement Case Study 3 (data dependent computation with GPU/CUDA) • Fast Effective Lightning Strike Simulation • Laplace’s Equation • Simulation • CPU Only • CPU/GPU w/o shared memory • CPU/GPU with shared memory

Performance Improvement Case Study 4 (MATLAB Vs GPU/CUDA) • Different simulation models • Traditional sequential program • CUDA program (no shared memory) • CUDA program (with shared memory) • Traditional sequential MATLAB • Parallel MATLAB • CUDA/C parallel programming of the finite difference method based Laplace’s equation demonstrate up to 257x speedup and 97% energy savings over a parallel MATLAB implementation while solving a 4Kx4K problem with reasonable accuracy.

Performance Improvement Identify More Challenges • Sequential data-independent problems • C[]  A[] + B[] • C[5]  A[5] + B[5] • A’[]  A[] • A’[5]  A[5] • SMT capable multicore processor; CUDA/GPU Technology • Sequential data-dependent problems • B’[]  B[] • B’[5]  {B[4], B[5], B[6]} • Communication needed • Core 1 and Core 2 Core 1 Core 2 Core 1 Core 2

Performance Improvement Develop Solutions • Task Regrouping • Create threads • Data Regrouping • Regroup data • Data for each thread • Threads with G2s first • Then, threads with G1s (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy()

Performance Improvement Assess the Solutions • What is the Key? • Synchronization • With synchronization • Without synchronization • Fast Vs. Accuracy • Threads with G2s first • Then, threads with G1s (Step 2 of 5) CPU copies data to GPU On CUDA API: cudaMemCpy()

“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GP-GPU • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion

Energy-Efficient Computing Kansas Unique Challenge • Climate and Energy • Protect environment from harms due to climate change • Save natural energy

Energy-Efficient Computing “Power” Analysis • CPU with multiple GPU • GPU usages vary • Power Requirements • NVIDIA GTX 460 (336-core) - 160W [1] • Tesla C2075 (448-core) - 235W [2] • Intel Core i7 860 (4-core, 8-thread) - 150-245W [3, 4] • Dynamic GPU Selection • Depending on • the “tasks”/threads • GPU usages CPU GPU GPU GPU

Energy-Efficient Computing CPU-to-GPU Memory Mapping • GPU Shared Memory • Improves performance • CPU to GPU global memory • GPU global to shared • Data Regrouping • CPU to GPU global memory

Teaching Low-Power HPC Systems Integrate Research into Education • CS 794 – Multicore Architectures Programming • Multicore Architecture • Simultaneous Multithreading • Parallel Programming • Moore’s law • Amdahl’s law • Gustafson’s law • Law of diminishing returns • Koomey's law

“Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” Outline ► • Introduction • Single-Core to Multicore Architectures • Performance Improvement • Simultaneous Multithreading (SMT) • (SMT enabled) Multicore CPU with GP-GPU • Energy-Efficient Computing • Dynamic GPU Selection • CAPPLab • “People First” • Resources • Research Grants/Activities • Discussion

WSU CAPPLab CAPPLab • Computer Architecture & Parallel Programming Laboratory (CAPPLab) • Physical location: 245 Jabara Hall, Wichita State University • URL: http://www.cs.wichita.edu/~capplab/ • E-mail: capplab@cs.wichita.edu; Abu.Asaduzzaman@wichita.edu • Tel: +1-316-WSU-3927 • Key Objectives • Lead research in advanced-level computer architecture, high-performance computing, embedded systems, and related fields. • Teach advanced-level computer systems & architecture, parallel programming, and related courses.

WSU CAPPLab “People First” • Students • Kishore Konda Chidella, PhD Student • Mark P Allen, MS Student • Chok M. Yip, MS Student • Deepthi Gummadi, MS Student • Collaborators • Mr. John Metrow, Director of WSU HiPeCC • Dr. Larry Bergman, NASA Jet Propulsion Laboratory (JPL) • Dr. Nurxat Nuraje, Massachusetts Institute of Technology (MIT) • Mr. M. Rahman, Georgia Institute of Technology (Georgia Tech) • Dr. Henry Neeman, University of Oklahoma (OU)

WSU CAPPLab Resources • Hardware • 3 CUDA Servers – CPU: Xeon E5506, 2x 4-core, 2.13 GHz, 8GB DDR3; GPU: Telsa C2075, 14x 32 cores, 6GB GDDR5 memory • 2 CUDA PCs – CPU: Xeon E5506, … • Supercomputer (Opteron 6134, 32 cores per node, 2.3 GHz, 64 GB DDR3, Kepler card) via remote access to WSU (HiPeCC) • 2 CUDA enabled Laptops • More … • Software • CUDA, OpenMP, and Open MPI (C/C++ support) • MATLAB, VisualSim, CodeWarrior, more (as may needed)

WSU CAPPLab Scholarly Activities • WSU became “CUDA Teaching Center” for 2012-13 • Grants from NSF, NVIDIA, M2SYS, Wiktronics • Teaching Computer Architecture and Parallel Programming • Publications • Journal: 21 published; 3 under preparation • Conference: 57 published; 2 under review; 6 under preparation • Book Chapter: 1 published; 1 under preparation • Outreach • USD 259 Wichita Public Schools • Wichita Area Technical and Community Colleges • Open to collaborate

WSU CAPPLab Research Grants/Activities • Grants • WSU: ORCA • NSF – KS NSF EPSCoR First Award • M2SYS-WSU Biometric Cloud Computing Research Grant • Teaching (Hardware/Financial) Award from NVIDIA • Teaching (Hardware/Financial) Award from Xilinx • Proposals • NSF: CAREER (working/pending) • NASA: EPSCoR (working/pending) • U.S.: Army, Air Force, DoD, DoE • Industry: Wiktronics LLC, NetApp Inc, M2SYS Technology

Thank You! ICIEV-2014; Dhaka, Bangladesh; 2014 “Discovering Energy-Efficient High-Performance Computing Systems? WSU CAPPLab may help!” QUESTIONS? Contact: Abu AsaduzzamanE-mail: abuasaduzzaman@ieee.orgPhone: +1-316-978-5261 http://www.cs.wichita.edu/~capplab/

Invited Talk 12:

Invited Talk 12:

Presentation Transcript

Toolbox Talk:Eye Safety

What am I going to talk about?

Objectives for today’s talk

ACCOUNTABLE TALK

scared to talk politics in church?

Integration of Artificial Intelligence & Operations Research Techniques

Outline of Talk

Talking about likes and dislikes

…Will we all talk like Americans?

OCTOBER 2013

LET’S TALK! How Accountable Talk Read Aloud can help our ELLs reach Common Core

Giving an effective presentation: Using Powerpoint and structuring a scientific talk

Teacher Talk 2011

Antibiotic Refresher talk

Tool Box Talk – Training Kit

small talk BIG TALK

Colored GSPN Models for the QoS Design of Internet Subnets

God Talk

What’s all this talk about MPLS?

IOT for Smart City

Smart Citizens - Harnessing the Power of Smart Citizens

Content Marketing: It takes a lot of planning to be this spontaneous

Invited Talk 12:

Invited Talk 12:

Presentation Transcript

Toolbox Talk:Eye Safety

What am I going to talk about?

Objectives for today’s talk

ACCOUNTABLE TALK

scared to talk politics in church?

Integration of Artificial Intelligence &amp; Operations Research Techniques

Outline of Talk

Talking about likes and dislikes

…Will we all talk like Americans?

OCTOBER 2013

LET’S TALK! How Accountable Talk Read Aloud can help our ELLs reach Common Core

Giving an effective presentation: Using Powerpoint and structuring a scientific talk

Teacher Talk 2011

Antibiotic Refresher talk

Tool Box Talk – Training Kit

small talk BIG TALK

Colored GSPN Models for the QoS Design of Internet Subnets

God Talk

What’s all this talk about MPLS?

IOT for Smart City

Smart Citizens - Harnessing the Power of Smart Citizens

Content Marketing: It takes a lot of planning to be this spontaneous

Integration of Artificial Intelligence & Operations Research Techniques