Multi-core computing

1. Multi-core computing Tim Harris

3. Summary The performance of one thread is highly dependent on what other threads are doing It�s important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafson�s law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

4. Merge sort

5. Merge sort, dual core

7. T7300 dual-core laptop

8. AMD Phenom 3-core

9. Summary The performance of one thread is highly dependent on what other threads are doing It�s important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafson�s law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

10. IntroductionWhy parallelism?Parallel hardwareAmdahl�s lawParallel algorithms: T1 and T8

12. A simple microprocessor model ~ 1985 Single h/w thread Instructions execute one after the other Memory access time ~ clock cycle time

13. Pipelined design

14. Superscalar design

15. Realistic memory accesses... Dynamic out-of-order Pipelined memory accesses Speculation

16. The power wall

17. The power wall Moore�s law: area of transistor is about 50% smaller each generation A given chip area accommodates twice as many transistors Pdyn = a f V2 Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f V cannot be reduced arbitrarily Halving V more than halves max f (for a given transistor) Physical limits Pleak = V(Isub + Iox) Reduce Isub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f) Reduce Iox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)

18. The memory wall

19. The memory wall

20. The ILP wall ILP = �Instruction level parallelism� Implicit parallelism between instructions in a single thread Identified by the hardware Speculate past memory accesses Speculate past control transfer Diminishing returns

21. Power wall + ILP wall + memory wall = brick wall Power wall means we can�t just clock processors faster any longer Memory wall means that many workload�s perf is dominated by memory access times ILP wall means we can�t find extra work to keep functional units busy while waiting for memory accesses


23. Multi-threaded h/w Multiple threads in a workload with: Poor spatial locality Frequent memory accesses

24. Multi-threaded h/w Multiple threads with synergistic resource needs

25. Multi-core h/w � common L2
















41. Multi-core h/w � separate L2

42. Multi-core h/w � additional L3

43. Multi-threaded multi-core h/w

44. SMP multiprocessor

45. NUMA multiprocessor

46. Three kinds of parallel hardware Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed Multiple cores Increase ops/cycle Don�t necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately

47. Summary The performance of one thread is highly dependent on what other threads are doing It�s important to understand the h/w of the machine, and the techniques it uses For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality Understand the different resource requirements of a program; computation, communication, locality Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them? Focus on the longest running parts of the program first; be realistic about possible speedups Aim to tackle larger workloads in constant time (c.f. Gustafson�s law) Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available Multiple parts of an algorithm may need to be parallelised with different techniques With modest numbers of cores, beware of overheads that must be recovered via parallelism

48. AMD Phenom


50. Amdahl�s law �Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with n cores, how many cores do you need to use to get a 4x speed-up on the overall algorithm?�

51. Amdahl�s law, f=70%






57. Amdahl�s law & multi-core

58. Perf of big & small cores




62. Asymmetric chips








70. Merge sort (2-core)



73. T8 (span): critical path length

74. T1 (work): time to run sequentially

75. Tserial: optimized sequential code

76. A good multi-core parallel algorithm T1/Tserial is low What we lose on sequential performance we must make up through parallelism Resource availability may limit the ability to do that T8 grows slowly with the problem size We tackle bigger problems by using more cores, not by running for longer

77. Quick-sort on �good� input




81. Scheduling from a DAG

82. Scheduling from a DAG

83. In CILK: Tp � T1 / p + c T8

84. In CILK: Tp � T1 / p + c T8


86. Example: thunk dependencies

87. Potential parallelism

88. Limit study: parallelism vs granularity

89. Measured performance

90. Further reading �Computer architecture: a quantitative approach�, Hennessy & Patterson �The landscape of parallel computing research: a view from Berkeley�, Asanovic et al (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf) �Amdahl�s law in the multicore era�, Hill & Marty, (http://www.cs.wisc.edu/multifacet/papers/tr1593_amdahl_multicore.pdf) �Parallel thinking�, Blelloch (http://www.cs.cmu.edu/~blelloch/papers/PPoPP09.pdf)

Multi-core computing

Multi-core computing

Presentation Transcript

Multi-core architectures

Multi-Core Systems

Multi-core architectures

Multi-Core Computing

A Bridging Model for Multi-Core Computing

マルチコア /Multi-Core

Multi-core Programming

Multi-core Programming

Multi-core processors

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-Core Computing

Multi-Core Development

Multi-Core Computing

FPGA Multi-core

Research Computing on Multi-core and Many-core Systems: Toward Extreme-scale Computing

Multi-core CPU’s

FPGA Multi-core

Why multi-threading/multi-core?

Multi-Core Computing