250 likes | 364 Views
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas). Random thoughts on Parallelism. Why the sudden preoccupation with parallelism? The Silliness (or what I call Meganonsense) Break the problem Use half the energy 1000 mickey mouse cores Hardware is sequential
E N D
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) • Break the problem Use half the energy • 1000 mickey mouse cores • Hardware is sequential • Server throughput (how many pins?) • What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) • Dark silicon • Amdahl’s Law • The Cloud • The answer • The fundamental concept vis-à-vis parallelism • What it means re: the transformation hierarchy
Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) • Break the problem Use half the energy • 1000 mickey mouse cores • Hardware is sequential • Server throughput (how many pins?) • What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) • Dark silicon • Amdahl’s Law • The Cloud • The answer • The fundamental concept vis-à-vis parallelism • What it means re: the transformation hierarchy
It starts with the raw material (Moore’s Law) • The first microprocessor (Intel 4004), 1971 • 2300 transistors • 106 KHz • The Pentium chip, 1992 • 3.1 million transistors • 66 MHz • Today • more than one billion transistors • Frequencies in excess of 5 GHz • Tomorrow ?
Too many people do not realize:Parallelism did not start with Multi-core • Pipelining • Out-of-order Execution • Multiple operations in a single microinstruction • VLIW (horizontal microcode exposed to the software)
Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) • Break the problem Use half the energy • 1000 mickey mouse cores • Hardware is sequential • Server throughput (how many pins?) • What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) • Dark silicon • Amdahl’s Law • The Cloud • The answer • The fundamental concept vis-à-vis parallelism • What it means re: the transformation hierarchy
One thousand mickey mouse cores • Why not a million? Why not ten million? • Let’s start with 16 • What if we could replace 4 with one more powerful core? • …and we learned: • One more powerful core is not enough • Sometimes we need several • Morphcore was born • BUT not all morphcore (fixed function vs flexibility)
Large core Largecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Large core Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Largecore Largecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore Niagara-likecore ACMP Approach “Tile-Large” Approach “Niagara” Approach The Asymmetric Chip Multiprocessor (ACMP)
Large core vs. Small Core LargeCore SmallCore • Out-of-order • Wide fetch e.g. 4-wide • Deeper pipeline • Aggressive branch predictor (e.g. hybrid) • Many functional units • Trace cache • Memory dependence speculation • In-order • Narrow Fetch e.g. 2-wide • Shallow pipeline • Simple branch predictor (e.g. Gshare) • Few functional units
Server throughput • The Good News: Not a software problem • Each core runs its own problem • The Bad News: How many pins? • Memory bandwidth • More Bad News: How much energy? • Each core runs its own problem
What about GPUs and Data Base • In theory, absolutely! • GPUs (SMT + SIMD + Predication) • Provided there are no conditional branches (Divergence) • Provided memory accesses line up nicely (Coalescing) • Data Bases • Provided there are no critical sections
Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) • Break the problem Use half the energy • 1000 mickey mouse cores • Hardware is sequential • Server throughput (how many pins?) • What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) • Dark silicon • Amdahl’s Law • The Cloud • The answer • The fundamental concept vis-à-vis parallelism • What it means re: the transformation hierarchy
Dark Silicon • Too many transistors: we can not power them all • All those cores powered down • All that parallelism wasted • Not really: The Refrigerator! (aka: Accelerators) • Fork (in parallel) • Although not all at the same time!
Amdahl’s Law • The serial bottleneck always limits performance • Heterogeneous cores AND control over them can minimize the effect
The Cloud • It is behind the curtain, how to manage it • Answer: the on-chip run-time system • Answer: Pragmas beyond the Cloud
Random thoughts on Parallelism • Why the sudden preoccupation with parallelism? • The Silliness (or what I call Meganonsense) • Break the problem Use half the energy • 1000 mickey mouse cores • Hardware is sequential • Server throughput (how many pins?) • What about GPUs and Data Base? • Current bugs to exploiting parallelism (or are they?) • Dark silicon • Amdahl’s Law • The Cloud • The answer • The fundamental concept vis-à-vis parallelism • What it means re: the transformation hierarchy
The fundamental concept: Synchronization
Algorithm Program ISA (Instruction Set Arch) Microarchitecture Circuits Problem Electrons
At every layer we synchronize • Algorithm: task dependencies • ISA: sequential control flow (implicit) • Microarchitecture: ready bits • Circuit : clock cycle (implicit)
Who understands this? • Should this be part of students’ parallelism education? • Where should it come in the curriculum? • Can students even understand these different layers?
Parallel to Sequential to Parallel • Guri says: think sequential, execute parallel • i.e. don’t throw away 60 years of computing experience • The original HPS model of out-of-order execution • Synchronization is obvious: restricted data flow • At the higher level, parallel at larger granularity • Pragmas in JAVA? Who would have thought! • Dave Kuck’s CEDAR project, vintage 1985 • Synchronization is necessary: course grain data flow
Can we do more? • The run-time system – part of the chip design • The chip knows the chip resources • On-chip monitoring can supply information • The run-time system can direct the use of those resources • The Cloud – the other extreme, and today’s be-all • How do we harness its capability? • What is needed from the hierarchy to make it work
My message • Parallelism is a serious goal IF we want to solve the most challenging problems (Cure cancer, predict tsunamis) • Telling people to think parallel is nice, but often silly • Examining the transformation hierarchy and seeing where we can leverage seems to me a sounder approach