A Roadmap to Restoring Computing's Former Glory

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Era of DIY: • Multicore • Reconfigurable • GPUs • Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance” Golden era of computer architecture ~ 3 years behind SPEC CINT Performance (log. Scale) CPU92 CPU95 CPU2000 CPU2006 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Year

P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Commit Automatic Allocation/Scheduling Parallel Resources

Multicore Architecture (Circa 2010) Automatic Speculation Automatic Pipelining Commit Automatic Allocation/Scheduling Parallel Resources

Parallel Library Calls Threads Time Realizable parallelism Threads Credit: Jack Dongarra Time

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’sformer glory. Parallel Programming Computer Architecture Parallel Libraries Automatic Parallelization

Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining. • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

2 3 4 5 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE Core 4 Core 1 Core 2 Core 3 0 LD:1 1 LD:2 LD:3 W:1 W:2 LD:4 LD:5 W:3 C:1 W:4 C:2 C:3

PDG Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } A1 D1 D2 Program Dependence Graph B1 A2 A B Time C2 C1 B2 C D Control Dependence Data Dependence

Spec-DOALL SpecDOALL Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } A1 D1 D2 Program Dependence Graph B1 A2 A B Time C2 C1 B2 C D Control Dependence Data Dependence

Spec-DOALL SpecDOALL Core 3 Core 1 Core 2 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } D2 D3 D1 Program Dependence Graph A1 A3 A2 A B Time B3 C2 C1 B1 C3 B2 C D Control Dependence Data Dependence

Spec-DOALL SpecDOALLPerf Core 3 Core 1 Core 2 Example A: while (node) { while (true) { B: node = node->next; C: res = work(node); D: write(res); } D3 D1 D2 D4 D3 D2 Program Dependence Graph B4 B3 B2 A1 A3 A2 A B Time B3 B1 C3 C2 C3 C4 B2 C2 C1 197.parser C Slowdown D Control Dependence Data Dependence

Spec-DOACROSS Spec-DSWP DOACROSSDSWP Throughput: 1 iter/cycle Throughput: 1 iter/cycle Core 1 Core 2 Core 1 Core 2 Core 3 Core 3 D4 C4 B4 B6 B1 B3 B1 B4 B7 B5 B2 B7 B2 C4 C5 C1 C1 C2 D2 C3 D5 C5 C2 C6 B5 D5 D1 B3 C3 D2 C6 D3 B6 D4 D1 D3 Time Time

Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle LatencyProblem Comm.Latency = 2: 1 iter/cycle Comm.Latency = 2: 0.5 iter/cycle Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 D4 B4 B6 B3 B1 B7 C1 D1 B1 B4 B7 B5 B2 C4 Pipeline Fill time D5 C4 C3 C1 C2 B2 B5 D2 C5 C2 C6 C5 D5 C3 B3 B6 D2 C6 D3 D3 D4 D1 Time Time

TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster

Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining.  • Low overhead access to programmer insight. • Code reuse. Ideally, this includes support of legacy codes as well as new codes. • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

Execution Plan Core 3 Core 1 Core 2 alloc1 char *memory; void * alloc(int size); alloc2 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc3 Time alloc4 alloc5 alloc6

Execution Plan char *memory; void * alloc(int size); Core 3 Core 1 Core 2 alloc1 @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc2 alloc3 Time alloc4 alloc5 alloc6

Execution Plan Core 3 Core 1 Core 2 char *memory; void * alloc(int size); alloc1 @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } alloc2 alloc3 Time alloc4 Easily Understood Non-Determinism! alloc5 alloc6

~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch [MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]

Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining.  • Low overhead access to programmer insight.  • Code reuse. Ideally, this includes support of legacy codes as well as new codes.  • Intelligent automatic parallelization. One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05] 0.90X Rotate Unroll 0.10X Sum Reduction Rotate 30.0X Unroll Sum Reduction 1.5X Rotate 1.1X Unroll Sum Reduction 0.8X

PS-DSWP Complainer

PS-DSWP Complainer Who can help me? Programmer Annotation Sum Reduction Unroll Rotate Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls

PS-DSWP Complainer Sum Reduction

PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative

PS-DSWP Complainer Sum Reduction PROGRAMMER Commutative LIBRARY Commutative

Multicore Needs: • Automatic resource allocation/scheduling, speculation/commit, and pipelining.  • Low overhead access to programmer insight.  • Code reuse. Ideally, this includes support of legacy codes as well as new codes.  • Intelligent automatic parallelization.  One Implementation Machine Specific Performance Primitives Speculative Optis New or Existing Sequential Code Insight Annotation Other Optis DSWP Family Optis Parallelized Code New or Existing Libraries Insight Annotation Complainer/Fixer

Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

Restoration of Trend

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices • Era of DIY: • Multicore • Reconfigurable • GPUs • Clusters Compiler technology inspired class of architectures?

The End

A Roadmap to Restoring Computing's Former Glory

A Roadmap to Restoring Computing's Former Glory

Presentation Transcript

A Roadmap to Business Resiliency

A Restoring C hurch

Glory to God, glory to God, glory to God forever!

A Roadmap to Sentence Combining

Glory to God

Glory To God Forever Music by Brian Doerksen Glory to God, glory to God, glory to God, forever!

A Roadmap to Meaningful Use

CALLED TO GLORY

GLORY TO GOD

CALLED TO GLORY

A Roadmap to Business Resiliency

From Glory to Glory

GLORY to GOD

A Roadmap to Continuous Integration

A Roadmap to Convergence

A Greater Glory

A Roadmap to Business Resiliency

How Microsoft is returning to its former AI glory

Restoring Your Car To Its Former Glory

A Complete Guide To Restoring a Pool

Finding A Hair Transplant Doctor That Can Restore Your Hairline to Its Former Glory

A Roadmap to Software Localization