Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) • Break the problem  Use half the energy • 1000 mickey mouse cores • Hardware is sequential • Server throughput (how many pins?) • What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) • Dark silicon • Amdahl’s Law • The Cloud • The answer • The fundamental concept vis-à-vis parallelism • What it means re: the transformation hierarchy

It starts with the raw material (Moore’s Law)‏ • The first microprocessor (Intel 4004), 1971 • 2300 transistors • 106 KHz • The Pentium chip, 1992 • 3.1 million transistors • 66 MHz • Today • more than one billion transistors • Frequencies in excess of 5 GHz • Tomorrow ?

And what we have done with this raw material

Too many people do not realize:Parallelism did not start with Multi-core • Pipelining • Out-of-order Execution • Multiple operations in a single microinstruction • VLIW (horizontal microcode exposed to the software)

One thousand mickey mouse cores • Why not a million? Why not ten million? • Let’s start with 16 • What if we could replace 4 with one more powerful core? • …and we learned: • One more powerful core is not enough • Sometimes we need several • Morphcore was born • BUT not all morphcore (fixed function vs flexibility)

Large core Largecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Large core Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Largecore Largecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore ACMP Approach “Tile-Large” Approach “Niagara” Approach The Asymmetric Chip Multiprocessor (ACMP)

Large core vs. Small Core LargeCore SmallCore • Out-of-order • Wide fetch e.g. 4-wide • Deeper pipeline • Aggressive branch predictor (e.g. hybrid)‏ • Many functional units • Trace cache • Memory dependence speculation • In-order • Narrow Fetch e.g. 2-wide • Shallow pipeline • Simple branch predictor (e.g. Gshare)‏ • Few functional units

Throughput vs. Serial Performance

Server throughput • The Good News: Not a software problem • Each core runs its own problem • The Bad News: How many pins? • Memory bandwidth • More Bad News: How much energy? • Each core runs its own problem

What about GPUs and Data Base • In theory, absolutely! • GPUs (SMT + SIMD + Predication) • Provided there are no conditional branches (Divergence) • Provided memory accesses line up nicely (Coalescing) • Data Bases • Provided there are no critical sections

Dark Silicon • Too many transistors: we can not power them all • All those cores powered down • All that parallelism wasted • Not really: The Refrigerator! (aka: Accelerators) • Fork (in parallel) • Although not all at the same time!

Amdahl’s Law • The serial bottleneck always limits performance • Heterogeneous cores AND control over them can minimize the effect

The Cloud • It is behind the curtain, how to manage it • Answer: the on-chip run-time system • Answer: Pragmas beyond the Cloud

The fundamental concept: Synchronization

Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons

At every layer we synchronize • Algorithm: task dependencies • ISA: sequential control flow (implicit) • Microarchitecture: ready bits • Circuit : clock cycle (implicit)

Who understands this? • Should this be part of students’ parallelism education? • Where should it come in the curriculum? • Can students even understand these different layers?

Parallel to Sequential to Parallel • Guri says: think sequential, execute parallel • i.e. don’t throw away 60 years of computing experience • The original HPS model of out-of-order execution • Synchronization is obvious: restricted data flow • At the higher level, parallel at larger granularity • Pragmas in JAVA? Who would have thought! • Dave Kuck’s CEDAR project, vintage 1985 • Synchronization is necessary: course grain data flow

Can we do more? • The run-time system – part of the chip design • The chip knows the chip resources • On-chip monitoring can supply information • The run-time system can direct the use of those resources • The Cloud – the other extreme, and today’s be-all • How do we harness its capability? • What is needed from the hierarchy to make it work

My message • Parallelism is a serious goal IF we want to solve the most challenging problems (Cure cancer, predict tsunamis) • Telling people to think parallel is nice, but often silly • Examining the transformation hierarchy and seeing where we can leverage seems to me a sounder approach

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

Presentation Transcript

Instruction Level Parallelism

One and a Half Syndrome

vashikaran mantra

The History of Management Thought

Thought Starter…

Main Ideas and Supporting Details

Lecture 3 Instruction-Level Parallelism and Its Exploitation (Chapter 2 in textbook)

The Ch’i-lin Purse Retold by Linda Fang

Ada 2005: Putting it all together

Instruction Level Parallelism and Tomasulo’s approach

Compiler techniques for exposing ILP

One and a Half Syndrome

Chapter 3: Instruction-Level Parallelism

Parallelism

There is something fun we can do when we meet. Some are silly, some are proper ways to greet.

Half Toning

CS 267 Sources of Parallelism and Locality in Simulation

Silly Billy

Course contents

All About Ayurvedic Roop Mantra Face Cream,Fash and Capsules

Biology EOC Review

How to Become a Thought Leader in Your Niche