1 / 35

A Roadmap to Restoring Computing's Former Glory

A Roadmap to Restoring Computing's Former Glory. David I. August. Princeton University. (Not speaking for Parakinetics, Inc.). Era of DIY: Multicore Reconfigurable GPUs Clusters. 10 Cores!. 10-Core Intel Xeon “Unparalleled Performance”. Golden era of computer architecture.

ganya
Download Presentation

A Roadmap to Restoring Computing's Former Glory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

  2. Era of DIY: • Multicore • Reconfigurable • GPUs • Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance” Golden era of computer architecture ~ 3 years behind SPEC CINT Performance (log. Scale) CPU92 CPU95 CPU2000 CPU2006 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Year

  3. P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Commit Automatic Allocation/Scheduling Parallel Resources

  4. Multicore Architecture (Circa 2010) Automatic Speculation Automatic Pipelining Commit Automatic Allocation/Scheduling Parallel Resources

  5. Parallel Library Calls Threads Time Realizable parallelism Threads Credit: Jack Dongarra Time

  6. “Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

  7. Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’sformer glory. Parallel Programming Computer Architecture Parallel Libraries Automatic Parallelization

  8. Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

  9. 2 3 4 5 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE Core 4 Core 1 Core 2 Core 3 0 LD:1 1 LD:2 LD:3 W:1 W:2 LD:4 LD:5 W:3 C:1 W:4 C:2 C:3

  10. PDG Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } A1 D1 D2 Program Dependence Graph B1 A2 A B Time C2 C1 B2 C D Control Dependence Data Dependence

  11. Spec-DOALL SpecDOALL Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } A1 D1 D2 Program Dependence Graph B1 A2 A B Time C2 C1 B2 C D Control Dependence Data Dependence

  12. Spec-DOALL SpecDOALL Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } D2 D3 D1 Program Dependence Graph A1 A3 A2 A B Time B3 C2 C1 B1 C3 B2 C D Control Dependence Data Dependence

  13. Spec-DOALL SpecDOALLPerf Core 3 Core 1 Core 2 Example A: while (node) { while (true) { B: node = node->next; C: res = work(node); D: write(res); } D3 D1 D2 D4 D3 D2 Program Dependence Graph B4 B3 B2 A1 A3 A2 A B Time B3 B1 C3 C2 C3 C4 B2 C2 C1 197.parser C Slowdown D Control Dependence Data Dependence

  14. Spec-DOACROSS Spec-DSWP DOACROSSDSWP Throughput: 1 iter/cycle Throughput: 1 iter/cycle Core 1 Core 2 Core 1 Core 2 Core 3 Core 3 D4 C4 B4 B6 B1 B3 B1 B4 B7 B5 B2 B7 B2 C4 C5 C1 C1 C2 D2 C3 D5 C5 C2 C6 B5 D5 D1 B3 C3 D2 C6 D3 B6 D4 D1 D3 Time Time

  15. Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle LatencyProblem Comm.Latency = 2: 1 iter/cycle Comm.Latency = 2: 0.5 iter/cycle Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 D4 B4 B6 B3 B1 B7 C1 D1 B1 B4 B7 B5 B2 C4 Pipeline Fill time D5 C4 C3 C1 C2 B2 B5 D2 C5 C2 C6 C5 D5 C3 B3 B6 D2 C6 D3 D3 D4 D1 Time Time

  16. TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster

  17. Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining.  • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

  18. Execution Plan Core 3 Core 1 Core 2 alloc1 char *memory; void * alloc(int size); alloc2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc3 Time alloc4 alloc5 alloc6

  19. Execution Plan char *memory; void * alloc(int size); Core 3 Core 1 Core 2 alloc1 @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc2 alloc3 Time alloc4 alloc5 alloc6

  20. Execution Plan Core 3 Core 1 Core 2 char *memory; void * alloc(int size); alloc1 @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc2 alloc3 Time alloc4 Easily Understood Non-Determinism! alloc5 alloc6

  21. ~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch [MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]

  22. Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining.  • Low overhead access to programmer insight.  • Code reuse. Ideally, this includes support of legacy codes as well as new codes.  • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

  23. Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05] 0.90X Rotate Unroll 0.10X Sum Reduction Rotate 30.0X Unroll Sum Reduction 1.5X Rotate 1.1X Unroll Sum Reduction 0.8X

  24. PS-DSWP Complainer

  25. PS-DSWP Complainer Who can help me? Programmer Annotation Sum Reduction Unroll Rotate Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls

  26. PS-DSWP Complainer Sum Reduction

  27. PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative

  28. PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative LIBRARY Commutative

  29. PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative LIBRARY Commutative

  30. Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining.  • Low overhead access to programmer insight.  • Code reuse. Ideally, this includes support of legacy codes as well as new codes.  • Intelligent automatic parallelization.  One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

  31. Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

  32. Restoration of Trend

  33. “Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices • Era of DIY: • Multicore • Reconfigurable • GPUs • Clusters Compiler technology inspired class of architectures?

  34. The End

More Related