Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]

PPC 2012 - Intro, Syllabus & Prelims Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m] chrisc@cs.rpi.edu www.rpi.edu/~carotc/COURSES/PARALLEL/SPRING-2012 CSCI-4320/6360: Parallel Programming & ComputingWest Hall, Tues./Fri. 12-1:20 p.m.Introduction, Syllabus & Prelims

PPC 2012 - Intro, Syllabus & Prelims Let’s Look at the Syllabus… • See the syllabus on the course webpage.

PPC 2012 - Intro, Syllabus & Prelims To Make A Fast Parallel Computer You Need a Faster Serial Computer…well sorta… • Review of… • Instructions… • Instruction processing.. • Put it together…why the heck do we care about or need a parallel computer? • i.e., they are really cool pieces of technology, but can they really do anything useful beside compute Pi to a few billion more digits…

PPC 2012 - Intro, Syllabus & Prelims Processor Instruction Sets • In general, a computer needs a few different kinds of instructions: • mathematical and logical operations • data movement (access memory)‏ • jumping to new places in memory • if the right conditions hold. • I/O (sometimes treated as data movement)‏ • All these instructions involve using registers to store data as close as possible to the CPU • E.g. $t0, $s0 in MIPs on %eax, %ebx in x86

PPC 2012 - Intro, Syllabus & Prelims a=(b+c)-(d+e); $s0 $s1 $s2 $s3 $s4 add $t0, $s1, $s2 # t0 = b+c add $t1, $s3, $s4 # t1 = d+e sub $s0, $t0, $t1 # a = $t0–$t1

PPC 2012 - Intro, Syllabus & Prelims lw destreg, const(addrreg)‏ “Load Word” A number Name of register to get base address from Name of register to put value in address = (contents of addrreg) + const

PPC 2012 - Intro, Syllabus & Prelims Array Example: a=b+c[8]; lw $t0,8($s2) # $t0 = c[8] add $s0, $s1, $t0 # $s0=$s1+$t0 (yeah, this is not quite right …)‏ $s0 $s2 $s1

PPC 2012 - Intro, Syllabus & Prelims lw destreg, const(addrreg)‏ “Load Word” A number Name of register to get base address from Name of register to put value in address = (contents of addrreg) + const

PPC 2012 - Intro, Syllabus & Prelims sw srcreg, const(addrreg)‏ “Store Word” A number Name of register to get base address from Name of register to get value from address = (contents of addrreg) + const

PPC 2012 - Intro, Syllabus & Prelims How are instructions processed? • In the simple case… • Fetch instruction from memory • Decode it (read op code, and use registers based on what instruction the op code says • Execute the instruction • Write back any results to register or memory • Complex case… • Pipeline – overlap instruction processing… • Superscalar – multi-instruction issue per clock cycle..

PPC 2012 - Intro, Syllabus & Prelims Simple (relative term) CPU Multicyle Datapath & Control

PPC 2012 - Intro, Syllabus & Prelims Simple (yeah right!) Instruction Processing FSM!

PPC 2012 - Intro, Syllabus & Prelims Pipeline Processing w/ Laundry • While the first load is drying, put the second load in the washing machine. • When the first load is being folded and the second load is in the dryer, put the third load in the washing machine. • NOTE: unrealistic scenario for CS students, as most only own 1 load of clothes…

PPC 2012 - Intro, Syllabus & Prelims 1 6 P M 7 8 9 1 0 1 1 1 2 2 A M T i m e T a s k o r d e r A B C D 1 6 P M 7 8 9 1 0 1 1 1 2 2 A M T i m e T a s k o r d e r A B C D

PPC 2012 - Intro, Syllabus & Prelims Pipelined DP w/ signals

PPC 2012 - Intro, Syllabus & Prelims Pipelined Instruction.. But wait, we’ve got dependencies!

PPC 2012 - Intro, Syllabus & Prelims Pipeline w/ Forwarding Values

PPC 2012 - Intro, Syllabus & Prelims Where Forwarding Fails…must stall

PPC 2012 - Intro, Syllabus & Prelims How Stalls Are Inserted

PPC 2012 - Intro, Syllabus & Prelims What about those crazy branches? Problem: if the branch is taken, PC goes to addr 72, but don’t know until after 3 other instructions are processed

PPC 2012 - Intro, Syllabus & Prelims Dynamic Branch Prediction • From the phase “There is no such thing as a typical program”, this implies that programs will branch is different ways and so there is no “one size fits all” branch algorithm. • Alt approach: keep a history (1 bit) on each branch instruction and see if it was last taken or not. • Implementation: branch prediction buffer or branch history table. • Index based on lower part of branch address • Single bit indicates if branch at address was last taken or not. (1 or 0)‏ • But single bit predictors tends to lack sufficient history…

PPC 2012 - Intro, Syllabus & Prelims Solution: 2-bit Branch Predictor Must be wrong twice before changing predictionLearns if the branch is more biased towards “taken” or “not taken”

PPC 2012 - Intro, Syllabus & Prelims Even more performance… • Ultimately we want greater and greater Instruction Level Parallelism (ILP)‏ • How? • Multiple instruction issue. • Results in CPI’s less than one. • Here, instructions are grouped into “issue slots”. • So, we usually talk about IPC (instructions per cycle)‏ • Static: uses the compiler to assist with grouping instructions and hazard resolution. Compiler MUST remove ALL hazards. • Dynamic: (i.e., superscalar) hardware creates the instruction schedule based on dynamically detected hazards

PPC 2012 - Intro, Syllabus & Prelims Example Static 2-issue Datapath Additions: • 32 bits from intr. Mem • Two read, 1 write ports on reg file • 1 more ALU (top handles address calc)‏

PPC 2012 - Intro, Syllabus & Prelims ALU/Branch Data Xfer Inst. Cycles Loop: lw $t0, 0($s1)‏ 1 addi $s1, $s1, -4 2 addu $t0, $t0, $s2 3 bne $s1, $zero, Loop sw $t0, 4($s1)‏ 4 Ex. 2-Issue Code Schedule Loop: lw $t0, 0($s1) #t0=array element addiu $t0, $t0, $s2 #add scalar in $s2 sw $t0, 0($s1) #store result addi $s1, $s1, -4 # dec pointer bne $s1, $zero, Loop # branch $s1!=0 It take 4 clock cycles for 5 instructions or IPC of 1.25

PPC 2012 - Intro, Syllabus & Prelims More Performance: Loop Unrolling • Technique where multiple copies of the loop body are made. • Make more ILP available by removing dependencies. • How? Complier introduces additional registers via “register renaming”. • This removes “name” or “anti” dependence • where an instruction order is purely a consequence of the reuse of a register and not a real data dependence. • No data values flow between one pair and the next pair • Let’s assume we unroll a block of 4 interations of the loop..

PPC 2012 - Intro, Syllabus & Prelims Dynamic Scheduled Pipeline

PPC 2012 - Intro, Syllabus & Prelims Intel P4 Dynamic Pipeline – Looks like a cluster .. Just much much smaller…

PPC 2012 - Intro, Syllabus & Prelims Summary of Pipeline Technology We’ve exhausted this!! IPC just won’t go much higher… Why??

PPC 2012 - Intro, Syllabus & Prelims More Speed til it Hertz! • So, if no ILP is available, why not increase the clock frequency • E.g. why don’t we have 100 GHz processors today? • ANSWER: POWER & HEAT!! • With current CMOS technology power needs polynominal++ increase with a linear increase in clock speed. • Power leads to heat which will ultimately turn your CPU to heap of melted silicon!

PPC 2012 - Intro, Syllabus & Prelims

PPC 2012 - Intro, Syllabus & Prelims CPU Power Consumption… Typically, 100 watts is the limit..

PPC 2012 - Intro, Syllabus & Prelims Where do we go from here?(actually, we’ve arrived @ “here”!)‏ • Current Industry Trend: Multi-core CPUs • Typically lower clock rate (i.e., < 3 Ghz)‏ • 2, 4 and now 8 cores in single “socket” package • Because of smaller VLSI design processes (e.g. < 45 nm) can reduce power & heat.. • Potential for large, lucrative contracts in turning old dusty sequential codes to multi-core capable • Salesman: here’s your new $200 CPU, & oh, BTW, you’ll need this million $ consulting contract to port your code to take advantage of those extra cores! • Best business model since the mainframe! • More cores require greater and greater exploitation of available parallelism in an application which gets harder and harder as you scale to more processors.. • Due to cost, we’ll force in-house development of talent pool.. • You could be that talent pool…

PPC 2012 - Intro, Syllabus & Prelims Examples: Multicore CPUs • Brief listing of the recently released new 45 nm processors: Based on Intel site (Processor Model - Cache - Clock Speed - Front Side Bus)‏ • Desktop Dual Core: • E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz • E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz • E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz • Laptop Dual Core: • T9500 - 6 MB L2 - 2.60 GHz - 800 MHz • T9300 - 6 MB L2 - 2.50 GHz - 800 MHz • T8300 - 3 MB L2 - 2.40 GHz - 800 MHz • T8100 - 3 MB L2 - 2.10 GHz - 800 MHz • Desktop Quad Core: • Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz • Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz • Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz • Desktop Extreme Series: • QX9650 - 12 MB L2 - 3 GHz - 1333 MHz • Note: Intel's new 45nm Penryn-based Core 2 Duo and Core 2 Extreme processors were released on January 6, 2008. The new processors launch within a 35W thermal envelope. These are becoming the building block of today’s SCs Getting large amounts of speed requires lots of processors…

PPC 2012 - Intro, Syllabus & Prelims Amdahl’s Law • The anti-matter to Moore’s Law..(cpu performance doubles every 24 months) • Actually it is more about transistor counts… • The law states that given P processors and F is the faction of execution time that can be made parallel, then the amount of performance improvement (i.e., speedup) is: • 1/((1 – F) + F/N) Note as N  infinity, then speedup limited by 1/(1-F) So, then what about supercomputers ….

K Computer - #1 on Top500 @ 8.1 PF • ~548K cores over 672 racks • Consumes 9.89 Mwatts of power • Efficiency: .825 Gflops/watt • #6 on Green 500 list • 1 PB of RAM • Located at RIKEN in Japan

NSF MRI “Balanced” Cyberinstrument @ CCNI • Blue Gene/Q • 104 Tflops @ 2+ GF/watt • #1 on Green 500 list • 10PF and 20PF systems by 2013 • 32K threads/8K cores • 8 TB RAM • RAM Storage Accelerator • 4 TB @ 40+ GB/sec • 32 servers @ 128 GB each • Disk storage • 32 servers @ 24 TB disk • 4 meta-data servers w/ SSD • Bandwidth: 5 to 24 GB/sec • Viz systems • CCNI: 16 servers w/ dual GPUs • EMACS: display wall + servers

Disruptive Exascale Challenges… • 1 billion-way parallelism • Cost budget of O($200M) and O(20M watts) • Note: 1M watt per year == $1 million US dollars • Power • 1K-2K pJ/op today (according to Bill Harrod @ DOE) • Really @ 500 pJ/op using Blue Gene/Q data • Need 20 pJ/op ( ~50 GF/watt) to meet 20 Mwatt power ceiling • Dominated by data movement & overhead • Programmability • Writing an efficient parallel program is hard! • Locality required for efficiency • System complexity is BARRIER to programmability

Power Drives Radically New Hardware and Software • Compute is FREE, cost is moving data • All software will have to be radically redesigned to be locality aware • Bill Dally – All CS complexity theory will need to be re-done! Note: IBM Blue Gene/Q today @ 45 nm!!

Reliable Exabyte Storage is HARD! • Current Data: • Intrepid 478 Tflops but 60 GB/sec storage • Jaugar 2.2 Pflops but ~200 GB/sec storage • Storage BW/Flop is shrinking!! • In practice… • 1/3 of app exec time consumed by I/O • Checkpointing & downward spiral of I/O • Kernel panic @ 600K files.. The gap between computation and I/O performance continues to increase.

PPC 2012 - Intro, Syllabus & Prelims What are SC’s used for?? • Can you say “fever for the flavor”.. • Yes, Pringles used an SC to model airflow of chips as the entered “The Can”.. • Improved overall yield of “good” chips in “The Can” and less chips on the floor… • P&G has also used SCs to improve other products like: Tide, Pampers, Dawn, Downy and Mr. Clean

PPC 2012 - Intro, Syllabus & Prelims Patient Specific Vascular Surgical Planning • Virtual flow facility for patient specific surgical planning • High quality patient specific flow simulations needed quickly • Simulation on massively parallel computers • Cost only $600 on 32K Blue Gene/L vs. $50K for a repeat open heart surgery… • At exascale this will cost more like $6

Disruptive Opportunities @ 2^60 • A radical new way to think about science and engineering • Extreme time compression on very large-scale complex applications • Materials, Drug Discovery, Finance, Defense, and Disaster Planning & Recovery… • Technology enabler for … • Smartphone “supercomputers” w/ 25 GFlop and 100’s GB RAM • Petascale “supercomputer” in all major universities @ $200K • IBM Watson “desk-side” edition • Home users have 100 GB network and Terascale+ “home” supercomputers… By 2020, we will have unprecedented access to be vast amounts of data but the potential ubiquitous distruptive-scale computing power to use that data in our everyday lives

Prof. Chris Carothers Computer Science Department MRC 309a Office Hrs: Tuesdays, 1:30 – 3:30 p.m]