210 likes | 308 Views
WaveScalar and the WaveCache. Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington. Worries to Keep You up at Night. In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP.
E N D
WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington CSE P548
Worries to Keep You up at Night • In 2016 • 200,000 RISC-1 processors will fit on a die. • It will take 36 cycles to cross the die. • Still a lack of ILP. • Memory latency is still a problem. • For reasonable yields, only 1 transistor in 24 billion may be broken (if one flaw breaks a chip). CSE P548
WaveScalar’s Solution: Utilize Die Capability • A sea of simple, RISClike processors • in-order, single-issue • takes advantage of billions of transistors without exacerbating the other problems • short design & implementation time • operates at a short cycle • not need lots of ILP • fewer defects CSE P548
WaveScalar Processing Element CSE P548
WaveScalar’s Solution: Short Wires • Dataflow execution model • each processor executes when it’s operands have arrived • same principle as out-of-order execution but applies to the processor & includes fetching • no single program counter • short wires: • no long control lines • no centralized hardware data structures • no need for sequential & individual instruction fetches CSE P548
WaveScalar’s Solution: Short Wires • Dataflow execution model, cont’d. • differs from original dataflow computers • distributed tag management (matching between renamed producer-consumer registers) • special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution • all instructions in a “wave” execute on data with the same wave number CSE P548
WaveScalar’s Solution: Short Wires • Dataflow execution model • differs from original dataflow computers • explicit wave-ordered memory • compiler assigns sequence number to each memory operation in a bread-first manner • sequence number for an operation, its predecessor & successor all sent with produced data • wave & sequence numbers provide a total order on memory operations through any traversal of a wave + normal memory semantics + no need for special dataflow languages; C & C++ programs execute just fine CSE P548
WaveScalar’s Solution: Short Wires • Nearest-neighbor communication • code placement to locate consumers near their producers • short, fast node-to-node links rather than slow broadcast networks • exploits dataflow locality: probability of producing a value for a particular consumer instruction & therefore register (register renaming can destroy this) • instructions can dynamically migrate toward their neighbors during execution CSE P548
Branch Common Case Rare Case Join Dynamic Optimization • The common case has higher costs, and the branch can detect this… CSE P548
Branch Common Case Rare Case Join Dynamic Optimization • …and fix it, by moving. The join can do the same. CSE P548
PE Domain WaveScalar’s Solution: Short Wires CSE P548
Cluster WaveScalar’s Solution: Short Wires CSE P548
WaveScalar’s Solution: Creative Use of Untapped Parallelism • Expand the window for exploiting ILP • no in-order fetch using only one PC (sucking though a straw) • place instructions with the processing elements • out-of-order execution on a grand scale • Allow multiple threads to execute concurrently • OS & applications • multiple applications, parallel threads CSE P548
WaveScalar’s Solution: The I-Cache is the Processor • Model is processor-in-memory (PIM) • processing element associated with each instruction • WaveScalar version • processing elements placed in the I-cache to reduce latency CSE P548
Route around processors with flaws WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity • Fewer design & implementation errors from the grid of simple, uniform design • decentralized control • dynamic instruction migration CSE P548
Research Agenda: Architecture • WaveScalar ISA • Microarchitecture design • node design • domain size • cache-coherence across clusters • cluster arrangement • Control & memory speculation • WaveScalar instruction management • hardware for instruction placement & replacement • hardware for dynamic, self-optimizing placement CSE P548
Research Agenda: Architecture • Multithreaded WaveScalar • Design of the network & routing issues • Power management • Static & dynamic fault detection & recovery (rerouting instructions) • System-level design • Application to non-silicon designs CSE P548
Research Agenda: Compilers • Instruction placement • Revisit classic optimizations • code savings vs. communication costs • cache pollution vs. loop parallelism • New opportunities for optimization • a match between compiler & execute models • WaveScalar-specific instructions CSE P548
Research Agenda: OS & Networking • Tension between facilitating short routines & poor instruction locality • The software side of thread management • A bunch of stuff I don’t know about • optimizing the OS interface • new thread protection policies • memory management issues • security • lazy context switching • utilizing virtual machines CSE P548
Putting It All Together • Grid of hundreds (maybe thousands) of simple, data-flow processing nodes • no centralized control; scalable • few design errors; increase in yield • Processing nodes embedded in the I-cache • Instructions execute in place • Send results directly to the consumers • short, point-to-point links • Instructions can dynamically migrate • reduce latency to hot consumers • map around defects • 3X performance without any prediction mechanisms • more with them CSE P548