1 / 33

WaveScalar everyday dataflow

Steven Swanson Ken Michelson. Andrew Schwerin Mark Oskin. Steven Swanson Ken Michelson. Andrew Schwerin Mark Oskin. University of Washington. University of Washington. WaveScalar everyday dataflow. WaveScalar everyday dataflow. Sponsored by NSF and Intel. We should all be

Download Presentation

WaveScalar everyday dataflow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Steven Swanson Ken Michelson Andrew Schwerin Mark Oskin Steven Swanson Ken Michelson Andrew Schwerin Mark Oskin University of Washington University of Washington WaveScalareveryday dataflow WaveScalareveryday dataflow Sponsored by NSF and Intel

  2. We should all be going to SIGCOMM Things to keep you up at night ~2016 • Opportunities • 8 billion transistors; 28Ghz • 4GB per DRAM chip • 120 P4s OR 200,000 RISC-1 per die • Challenges • Communication • Defects • Complexity • Performance 2

  3. Monolithic von Neumann Processors A phenomenal success today. But in 2016?  Communication Broadcast networks  Defect tolerance 1 flaw -> paperweight  Complexity 40-60% of design is validation  Performance Deeper pipes unlikely (ISCA02) 3

  4. Decentralized Processors  Communication  Defect tolerance  Complexity ?Performance But how do you execute? 4

  5. Von Neumann is Centralized • PC-driven fetch is the problem • One program counter • Dataflow is the solution 5

  6. Dataflow has been done before... • Operations fire when data is available • No program counter • Convert true control dependences to data dependences • Exposes massive parallelism • But... 6

  7. ...it had issues • Scalability • Dataflow never executed mainstream code • No total load-store ordering • Special languages • Different memory semantics • No mutable data structures (mostly) • Functional (mostly) 7

  8. The WaveScalar ISA • WaveScalar is memory-centricdataflow • Compared to von Neumann • There is no fetch • Compared to traditional dataflow • Memory ordering is a first-class citizen • Normal memory semantics • No need for special languages • We can execute conventional languages, like C 8

  9. WaveScalar example A[j + i*i] = i; b = A[i*j]; 9

  10. i A j * * + + Load + Store b WaveScalar example A[j + i*i] = i; b = A[i*j]; 10

  11. WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 11

  12. WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 12

  13. WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 13

  14. WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 14

  15. WaveScalar example i A j A[j + i*i] = i; b = A[i*j]; * * + + Load + Store b 15

  16. i A j * * + + Load + Store b WaveScalar example A[j + i*i] = i; b = A[i*j]; 16

  17. 2 3 4 3 4 ? • Sequence # • Successor 4 5 6 4 7 8 • Predecessor 5 6 8 ? 8 9 Wave-ordered memory Load • Compiler annotates memory operations • Send memory requests in any order • Hardware reconstructs the correct order Store Store Load Load Store 17

  18. Store buffer 2 2 3 3 4 4 3 3 4 4 ? ? 4 7 8 4 7 8 ? ? 8 8 9 9 Wave-ordering Example Load Store 4 5 6 Store Load Load 5 6 8 Store 18

  19. Wave-ordered Memory • Waves are loop-free sections of the dataflow graph • Each dynamic wave has a wave number • Wave-ordered memory • Wave-numbers • Sequence number 19

  20. An ALU at every static instruction No processor core Instructions communicate directly i A j * * + + Load + Store b The Ideal WaveScalar Machine 20

  21. The WaveCache The I-Cache is the processor. 21

  22. Processing Element 22

  23. Domain 23

  24. Cluster 24

  25. The WaveCache • Long distance communication • Dynamic routing • Grid-based network • 1 cycle/cluster • Traditional cache coherence • Normal memory hierarchy • 16K instructions 25

  26. Demo! 26

  27. Performance • Binary translator from Alpha -> WaveScalar • Baseline • ~2000 Processing elements • No speculation • Compare to a very aggressive superscalar • 15-stage, 16-wide • 1024- registers, 1024-entry issue queue • Measure performance in Alpha-equivalent instructions per cycle 27

  28. WaveCache Performance 28

  29. Decentralized Processing  Communication  Defect tolerance  Complexity Performance 29

  30. Dataflow vs. von Neumann 30

  31. Dataflow vs. von Neumann 31

  32. Future work • Beyond von Neumann emulation • Compiler • Instruction Placement • Operating system • Fault tolerance • System integration/code migration? 32

  33. Conclusions • Decentralized computing will let you rest easy in 2016 • WaveScalar: Dataflow with normal memory!!! • WaveCache • “The I-Cache is the processor.” • Outperforms an OOO superscalar by 2.8x • Enormous opportunities for future research • Download at: http://wavescalar.cs.washington.edu 33

More Related