330 likes | 350 Views
Steven Swanson Ken Michelson. Andrew Schwerin Mark Oskin. Steven Swanson Ken Michelson. Andrew Schwerin Mark Oskin. University of Washington. University of Washington. WaveScalar everyday dataflow. WaveScalar everyday dataflow. Sponsored by NSF and Intel. We should all be
E N D
Steven Swanson Ken Michelson Andrew Schwerin Mark Oskin Steven Swanson Ken Michelson Andrew Schwerin Mark Oskin University of Washington University of Washington WaveScalareveryday dataflow WaveScalareveryday dataflow Sponsored by NSF and Intel
We should all be going to SIGCOMM Things to keep you up at night ~2016 • Opportunities • 8 billion transistors; 28Ghz • 4GB per DRAM chip • 120 P4s OR 200,000 RISC-1 per die • Challenges • Communication • Defects • Complexity • Performance 2
Monolithic von Neumann Processors A phenomenal success today. But in 2016? Communication Broadcast networks Defect tolerance 1 flaw -> paperweight Complexity 40-60% of design is validation Performance Deeper pipes unlikely (ISCA02) 3
Decentralized Processors Communication Defect tolerance Complexity ?Performance But how do you execute? 4
Von Neumann is Centralized • PC-driven fetch is the problem • One program counter • Dataflow is the solution 5
Dataflow has been done before... • Operations fire when data is available • No program counter • Convert true control dependences to data dependences • Exposes massive parallelism • But... 6
...it had issues • Scalability • Dataflow never executed mainstream code • No total load-store ordering • Special languages • Different memory semantics • No mutable data structures (mostly) • Functional (mostly) 7
The WaveScalar ISA • WaveScalar is memory-centricdataflow • Compared to von Neumann • There is no fetch • Compared to traditional dataflow • Memory ordering is a first-class citizen • Normal memory semantics • No need for special languages • We can execute conventional languages, like C 8
WaveScalar example A[j + i*i] = i; b = A[i*j]; 9
i A j * * + + Load + Store b WaveScalar example A[j + i*i] = i; b = A[i*j]; 10
WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 11
WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 12
WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 13
WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 14
WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 15
i A j * * + + Load + Store b WaveScalar example A[j + i*i] = i; b = A[i*j]; 16
2 3 4 3 4 ? • Sequence # • Successor 4 5 6 4 7 8 • Predecessor 5 6 8 ? 8 9 Wave-ordered memory Load • Compiler annotates memory operations • Send memory requests in any order • Hardware reconstructs the correct order Store Store Load Load Store 17
Store buffer 2 2 3 3 4 4 3 3 4 4 ? ? 4 7 8 4 7 8 ? ? 8 8 9 9 Wave-ordering Example Load Store 4 5 6 Store Load Load 5 6 8 Store 18
Wave-ordered Memory • Waves are loop-free sections of the dataflow graph • Each dynamic wave has a wave number • Wave-ordered memory • Wave-numbers • Sequence number 19
An ALU at every static instruction No processor core Instructions communicate directly i A j * * + + Load + Store b The Ideal WaveScalar Machine 20
The WaveCache The I-Cache is the processor. 21
Domain 23
Cluster 24
The WaveCache • Long distance communication • Dynamic routing • Grid-based network • 1 cycle/cluster • Traditional cache coherence • Normal memory hierarchy • 16K instructions 25
Demo! 26
Performance • Binary translator from Alpha -> WaveScalar • Baseline • ~2000 Processing elements • No speculation • Compare to a very aggressive superscalar • 15-stage, 16-wide • 1024- registers, 1024-entry issue queue • Measure performance in Alpha-equivalent instructions per cycle 27
Decentralized Processing Communication Defect tolerance Complexity Performance 29
Future work • Beyond von Neumann emulation • Compiler • Instruction Placement • Operating system • Fault tolerance • System integration/code migration? 32
Conclusions • Decentralized computing will let you rest easy in 2016 • WaveScalar: Dataflow with normal memory!!! • WaveCache • “The I-Cache is the processor.” • Outperforms an OOO superscalar by 2.8x • Enormous opportunities for future research • Download at: http://wavescalar.cs.washington.edu 33