并行程序设计 Parallel Programming

并行程序设计Parallel Programming Pingpeng Yuan

Parallel Programming • What • Why • How • Goal • exam

What is Parallel Programming? • Coordinating multiple processing elements to solve a problem

Parallelism - A simplistic understanding • Multiple tasks at once. • Distribute work into multiple execution units. • Two approaches - • Data Parallelism • Functional or Control Parallelism • 数据并行 – 将数据分成块，然后每一计算单元分别处理数据块. • 功能并行– 将问题划分成不同的任务，然后处理单元分别处理任务

Why • Why • Technology Trend • Application Needs

Human Architecture! Growth Performance Vertical Horizontal Growth 5 10 15 20 25 30 35 40 45 . . . . Age

Computational Power Improvement Multiprocessor Uniprocessor C.P.I. 1 2 . . . . No. of Processors

General Technology Trends • Microprocessor performance increases 50% - 100% per year • Clock frequency doubles every 3 years • Transistor count quadruples every 3 years

Clock Frequency Growth Rate(Intel family) • 30% per year

Intel Many Integrated Core (MIC) 32 core version of MIC:

Tilera’s 100 cores (June 2011) • Tilera has introduced a range of processors (64-bit Gx family: 36 cores, 64 cores and 100 cores), aiming to take on Intel in servers that handle high-throughput web applications • 64-bit cores running up to 1.5GHz • Manufactured in 40nm technology

Top500 Paradigm Change in HPC ….

GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)

The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure

GPU Will Top the List in Nov 2010

Transistor Count Growth Rate (Intel family) • Transistor count grows much faster than clock rate • - 40% per year, order of magnitude more contribution in 2 decades

How to Use More Transistors • Improve single threaded performance via architecture: • Not keeping up with potential given by technology • Use transistors for memory structures to improve data locality • Use parallelism • Instruction-level • Thread level

Similar Story for Storage(Transistor Count)

Trends in DRAM Capabilities • DRAM densities to double every 3 years • Projections for DRAM densities revised downwards over time • Current densities at 4Gb/die • DRAM data rates to double every 4-5 years • Projections for DRAM data rates revised upwards over time • Current data-rates at 2.2 Gb/s

Similar Story for Storage • 内存容量和内存访问速度差距更明显 • 从1980-95起内存容量扩大了1000x，每年增长50% • 延迟每年只降低了3% (only 2x from 1980-95) • 内存带宽增加了2x • 处理器速度变快，内存变大，内存相对变慢 • 需要并行传输更多地数据 • 需要更多的cache层次

存储层次Memory hierarchy 100 bytes CPU registers < 1 ns • 每一层次可视作为下一层的cache 32KB L1 cache 1 ns 256KB L2 cache 4 ns 1GB Primary Memory 60 ns 1TB Secondary Storage 10 ms Tertiary Storage 1s-1hr 1PB

Similar Story for Storage • 并行增加了每层的效率，但没有增加访问时间 • 并行和局部性在存储系统内部同样如此 • 内存芯片上同时取多个bit；然后在狭窄的通道上流水传输 • 缓冲区存储最近访问的数据

Disk trends • Disks too: Parallel disks plus caching • Disk capacity, 1975-1989 • doubled every 3+ years • 25% improvement each year • factor of 10 every decade • Still exponential, but far less rapid than processor performance • Disk capacity, 1990-recently • doubling every 12 months • 100% improvement each year • factor of 1000 every decade • Capacity growth 10x as fast as processor performance!

Disk trends • Only a few years ago, we purchased disks by the megabyte • Today, 1 GB (a billion bytes) costs $1 $0.50$0.05 from Dell • => 1 TB costs $1K $500 $50, 1 PB costs $1M $500K $50K • Technology is amazing • Flying a 747 6” above the ground • Reading/writing a strip of postage stamps

总之，飞速增长 • 处理器速度 • 存储能力 • 带宽相对于延迟和时钟频率之间的差距 • 并行是计算机体系结构发展的必然趋势

Commodity computer systems 19462003 General-purpose computing: Serial. 5KHz4GHz. 2004 General-purpose computing goes parallel. Clock frequency growth flat. #Transistors/chip 19802011: 29K30B! #”cores”: ~dy-2003

If you want your program to run significantly faster … you’re going to have to parallelize it

Drivers of Parallel Computing – Application needs ref: http://www.nvidia.com/object/tesla_computing_solutions.html

Applications of Parallel Processing

Example 1: Southern oceans heat Modeling (10-minute iterations) 300 GFLOP per iteration  300 000 iterations per 6 yrs = 1016 FLOP 4096 E-W regions 1024 N-S regions 12 layers in depth Why Do We Need Parallel Processing? Reasonable running time = Fraction of hour to several hours (103-104 s) In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations Example 2: Fluid dynamics calculations (1000  1000  1000 lattice) 109 lattice points  1000 FLOP/point  10 000 time steps = 1016 FLOP Example 3: Monte Carlo simulation of nuclear reactor 1011 particles to track (for 1000 escapes)  104 FLOP/particle = 1015 FLOP Decentralized supercomputing ( from Mathworld News, 2006/4/7 ): Grid of tens of thousands networked computers discovers 230402457– 1, the 43rdMersenne prime, as the largest known prime (9 152 052 digits )

大数据时代 根据IDC的报告，2012年全球的数据总量为2.7ZB，预计到2020年，全球的数据总量将达到35ZB。 • 大数据分类： • 互联网数据 • 科学数据 • 多媒体数据 • 行业应用数据，如金融数据

What Makes it Big Data? SOCIAL 101100101001001001101010101011100101010100100101 BLOG SMARTMETER VOLUME VELOCITY VARIETY VALUE

Numbers • How many data in the world? • 800 Terabytes, 2000 • 160 Exabytes, 2006 • 500 Exabytes(Internet), 2009 • 2.7 Zettabytes, 2012 • 35 Zettabytes by 2020 • How many data generated ONE day? • 7 TB, Twitter • 10 TB, Facebook Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute 2011

Big Data Use Cases

How • How • 实践是检验真理的唯一标准

Parallel Programming • 课程内容结构 • Parallel Architectures • Parallel Algorithms • Parallel Programming

Goal • Most people in the research community agree that there are at least two kinds of parallel programmers that will be important to the future of computing • Programmers that understand how to write software, but are naïve about parallelization and mapping to architecture • Programmers that are knowledgeable about parallelization, and mapping to architecture, so can achieve high performance

授课计划 • 总共32学时 • 4学时: 课程介绍+并行计算系统体系结构 • 4学时：并行算法基础 • 24学时：并行程序设计

考核要求 • 成绩评定方式：平时成绩（出勤率 + 1 doc） +考试成绩（分数比例：20：80） • 1 doc • 针对某一并行计算技术问题，对相关解决技术进行评论并给出改进 • 评论主要着眼于创新点和存在的问题，以及可能下一步的研究工作。

并行程序设计 Parallel Programming

并行程序设计 Parallel Programming

Presentation Transcript

Panel on Training and Developing HPC People

Parallel Programming in C with MPI and OpenMP

Parallel Programming for Laplace’s Equation

The State of Parallel Programming

Charisma: Orchestrating Migratable Parallel Objects

Parallel programming languages

Alternative and Experimental Parallel Programming Approaches CS433 Spring 2001

Architectural Support for Synchronization-Free Deterministic Parallel Programming

Chapter 11: Distributed Processing Parallel programming

The P-GRADE Visual Parallel Programming Environment

Parallel and Distributed Programming

Parallel Programming

Bulk-Synchronous Parallel ML Semantics and Implementation of the Parallel Juxtaposition

Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP *

Parallel Programming

Programming Models and All That

Designing Parallel Operating Systems via Parallel Programming

Parallel Programming on the SGI Origin2000

COP 4020 Programming Languages Parallel Programming in Ada and Java

High Performance Parallel Programming

Parallel Programming in C with MPI and OpenMP