890 likes | 1.15k Views
. . Example 1. Two threads incrementing different counters. In the first program, the counters are on the same cache line.In the second program, they are on different cache lines.. Summary. The performance of one thread is highly dependent on what other threads are doingIt's important to understa
E N D
1. Multi-core computing Tim Harris
3. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
4. Merge sort
5. Merge sort, dual core
7. T7300 dual-core laptop
8. AMD Phenom 3-core
9. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
10. IntroductionWhy parallelism?Parallel hardwareAmdahls lawParallel algorithms: T1 and T8
12. A simple microprocessor model ~ 1985 Single h/w thread
Instructions execute one after the other
Memory access time ~ clock cycle time
13. Pipelined design
14. Superscalar design
15. Realistic memory accesses... Dynamic out-of-order
Pipelined memory accesses
Speculation
16. The power wall
17. The power wall Moores law: area of transistor is about 50% smaller each generation
A given chip area accommodates twice as many transistors
Pdyn = a f V2
Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f
V cannot be reduced arbitrarily
Halving V more than halves max f (for a given transistor)
Physical limits
Pleak = V(Isub + Iox)
Reduce Isub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f)
Reduce Iox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)
18. The memory wall
19. The memory wall
20. The ILP wall ILP = Instruction level parallelism
Implicit parallelism between instructions in a single thread
Identified by the hardware
Speculate past memory accesses
Speculate past control transfer
Diminishing returns
21. Power wall + ILP wall + memory wall = brick wall Power wall means we cant just clock processors faster any longer
Memory wall means that many workloads perf is dominated by memory access times
ILP wall means we cant find extra work to keep functional units busy while waiting for memory accesses
22. IntroductionWhy parallelism?Parallel hardwareAmdahls lawParallel algorithms: T1 and T8
23. Multi-threaded h/w Multiple threads in a workload with:
Poor spatial locality
Frequent memory accesses
24. Multi-threaded h/w Multiple threads with synergistic resource needs
25. Multi-core h/w common L2
26. Multi-core h/w common L2
27. Multi-core h/w common L2
28. Multi-core h/w common L2
29. Multi-core h/w common L2
30. Multi-core h/w common L2
31. Multi-core h/w common L2
32. Multi-core h/w common L2
33. Multi-core h/w common L2
34. Multi-core h/w common L2
35. Multi-core h/w common L2
36. Multi-core h/w common L2
37. Multi-core h/w common L2
38. Multi-core h/w common L2
39. Multi-core h/w common L2
40. Multi-core h/w common L2
41. Multi-core h/w separate L2
42. Multi-core h/w additional L3
43. Multi-threaded multi-core h/w
44. SMP multiprocessor
45. NUMA multiprocessor
46. Three kinds of parallel hardware Multi-threaded cores
Increase utilization of a core or memory b/w
Peak ops/cycle fixed
Multiple cores
Increase ops/cycle
Dont necessarily scale caches and off-chip resources proportionately
Multi-processor machines
Increase ops/cycle
Often scale cache & memory capacities and b/w proportionately
47. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
48. AMD Phenom
49. IntroductionWhy parallelism?Parallel hardwareAmdahls lawParallel algorithms: T1 and T8
50. Amdahls law Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with n cores, how many cores do you need to use to get a 4x speed-up on the overall algorithm?
51. Amdahls law, f=70%
52. Amdahls law, f=70%
53. Amdahls law, f=70%
54. Amdahls law, f=10%
55. Amdahls law, f=98%
56. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
57. Amdahls law & multi-core
58. Perf of big & small cores
59. Amdahls law, f=98%
60. Amdahls law, f=75%
61. Amdahls law, f=5%
62. Asymmetric chips
63. Amdahls law, f=75%
64. Amdahls law, f=5%
65. Amdahls law, f=98%
66. Amdahls law, f=98%
67. Amdahls law, f=98%
68. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
69. IntroductionWhy parallelism?Parallel hardwareAmdahls lawParallel algorithms: T1 and T8
70. Merge sort (2-core)
71. Merge sort (4-core)
72. Merge sort (8-core)
73. T8 (span): critical path length
74. T1 (work): time to run sequentially
75. Tserial: optimized sequential code
76. A good multi-core parallel algorithm T1/Tserial is low
What we lose on sequential performance we must make up through parallelism
Resource availability may limit the ability to do that
T8 grows slowly with the problem size
We tackle bigger problems by using more cores, not by running for longer
77. Quick-sort on good input
78. Quick-sort on good input
79. Quick-sort on good input
80. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
81. Scheduling from a DAG
82. Scheduling from a DAG
83. In CILK: Tp T1 / p + c T8
84. In CILK: Tp T1 / p + c T8
85. Summary The performance of one thread is highly dependent on what other threads are doing
Its important to understand the h/w of the machine, and the techniques it uses
For modest numbers of cores, it may be possible to identify large granularity tasks, with good data locality
Understand the different resource requirements of a program; computation, communication, locality
Consider how data accesses will interact with the memory system; will the computation done on additional cores pay for the data to be brought to them?
Focus on the longest running parts of the program first; be realistic about possible speedups
Aim to tackle larger workloads in constant time (c.f. Gustafsons law)
Exploit asymmetric designs; larger cores for sequential perf, smaller cores where high degrees of parallelism are available
Multiple parts of an algorithm may need to be parallelised with different techniques
With modest numbers of cores, beware of overheads that must be recovered via parallelism
86. Example: thunk dependencies
87. Potential parallelism
88. Limit study: parallelism vs granularity
89. Measured performance
90. Further reading Computer architecture: a quantitative approach, Hennessy & Patterson
The landscape of parallel computing research: a view from Berkeley, Asanovic et al (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf)
Amdahls law in the multicore era, Hill & Marty, (http://www.cs.wisc.edu/multifacet/papers/tr1593_amdahl_multicore.pdf)
Parallel thinking, Blelloch (http://www.cs.cmu.edu/~blelloch/papers/PPoPP09.pdf)