350 likes | 482 Views
A Roadmap to Restoring Computing's Former Glory. David I. August. Princeton University. (Not speaking for Parakinetics, Inc.). Era of DIY: Multicore Reconfigurable GPUs Clusters. 10 Cores!. 10-Core Intel Xeon “Unparalleled Performance”. Golden era of computer architecture.
E N D
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
Era of DIY: • Multicore • Reconfigurable • GPUs • Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance” Golden era of computer architecture ~ 3 years behind SPEC CINT Performance (log. Scale) CPU92 CPU95 CPU2000 CPU2006 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Year
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Commit Automatic Allocation/Scheduling Parallel Resources
Multicore Architecture (Circa 2010) Automatic Speculation Automatic Pipelining Commit Automatic Allocation/Scheduling Parallel Resources
Parallel Library Calls Threads Time Realizable parallelism Threads Credit: Jack Dongarra Time
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’sformer glory. Parallel Programming Computer Architecture Parallel Libraries Automatic Parallelization
Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer
2 3 4 5 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE Core 4 Core 1 Core 2 Core 3 0 LD:1 1 LD:2 LD:3 W:1 W:2 LD:4 LD:5 W:3 C:1 W:4 C:2 C:3
PDG Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } A1 D1 D2 Program Dependence Graph B1 A2 A B Time C2 C1 B2 C D Control Dependence Data Dependence
Spec-DOALL SpecDOALL Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } A1 D1 D2 Program Dependence Graph B1 A2 A B Time C2 C1 B2 C D Control Dependence Data Dependence
Spec-DOALL SpecDOALL Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } D2 D3 D1 Program Dependence Graph A1 A3 A2 A B Time B3 C2 C1 B1 C3 B2 C D Control Dependence Data Dependence
Spec-DOALL SpecDOALLPerf Core 3 Core 1 Core 2 Example A: while (node) { while (true) { B: node = node->next; C: res = work(node); D: write(res); } D3 D1 D2 D4 D3 D2 Program Dependence Graph B4 B3 B2 A1 A3 A2 A B Time B3 B1 C3 C2 C3 C4 B2 C2 C1 197.parser C Slowdown D Control Dependence Data Dependence
Spec-DOACROSS Spec-DSWP DOACROSSDSWP Throughput: 1 iter/cycle Throughput: 1 iter/cycle Core 1 Core 2 Core 1 Core 2 Core 3 Core 3 D4 C4 B4 B6 B1 B3 B1 B4 B7 B5 B2 B7 B2 C4 C5 C1 C1 C2 D2 C3 D5 C5 C2 C6 B5 D5 D1 B3 C3 D2 C6 D3 B6 D4 D1 D3 Time Time
Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle LatencyProblem Comm.Latency = 2: 1 iter/cycle Comm.Latency = 2: 0.5 iter/cycle Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 D4 B4 B6 B3 B1 B7 C1 D1 B1 B4 B7 B5 B2 C4 Pipeline Fill time D5 C4 C3 C1 C2 B2 B5 D2 C5 C2 C6 C5 D5 C3 B3 B6 D2 C6 D3 D3 D4 D1 Time Time
TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster
Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer
Execution Plan Core 3 Core 1 Core 2 alloc1 char *memory; void * alloc(int size); alloc2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc3 Time alloc4 alloc5 alloc6
Execution Plan char *memory; void * alloc(int size); Core 3 Core 1 Core 2 alloc1 @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc2 alloc3 Time alloc4 alloc5 alloc6
Execution Plan Core 3 Core 1 Core 2 char *memory; void * alloc(int size); alloc1 @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc2 alloc3 Time alloc4 Easily Understood Non-Determinism! alloc5 alloc6
~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch [MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]
Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer
Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05] 0.90X Rotate Unroll 0.10X Sum Reduction Rotate 30.0X Unroll Sum Reduction 1.5X Rotate 1.1X Unroll Sum Reduction 0.8X
PS-DSWP Complainer
PS-DSWP Complainer Who can help me? Programmer Annotation Sum Reduction Unroll Rotate Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls
PS-DSWP Complainer Sum Reduction
PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative
PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative LIBRARY Commutative
PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative LIBRARY Commutative
Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer
Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices • Era of DIY: • Multicore • Reconfigurable • GPUs • Clusters Compiler technology inspired class of architectures?