Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal Presented By: Sarah Lynn Bird

Scalar Operand Networks • “A set of mechanisms that joins the dynamic operands and operations of a program in space to enact the computation specified by a program graph” • Physical Interconnection Network • Operation-operand matching system

Example Scalar Operand Networks Register File Raw Microprocessor

Design Issues • Delay Scalability • Intra-component delay • Inter-component delay • Managing latency • Bandwidth Scalability • Deadlock and Starvation • Efficient Operation-Operand Matching • Handling Exceptional Events

Operation-Operand Matching • 5-Tuples of Costs <SO, SL, NHL, RL, RO> • SO: Send Occupancy • The number of cycles that the ALU wastes in sending • SL: Send Latency • The number of cycles of delay for the message on the send side of the network • NHL: Network Hop Latency • The number of cycles of delay per hop • RL: Receive Latency • The number of cycles of delay between the final input arrives and the instruction is consumed • RO: Receive Occupancy • The number of cycles that an ALU wastes before employing a remote value

Raw Design • 2 Static Networks • Instructions from a 64KB cache • Point-to-point for operand transport • 2 Dynamic networks • Memory traffic, interrupts, user-level messages • 8 -stage in-order single-issue pipeline • 4-stage pipelined FPU • 32KB data cache • 32KB instruction cache • 16 Cores on a Chip

Experiments • Beetle: a cycle-accurate simulator • Actual Scalar Operand Network • Parameterized Scalar Operand Network without Contention • Data cache misses modeled correctly • Assume no instruction cache misses • Memory Model • Compiler maps memory to tiles • Each location has one home site • Benchmarks • From Spec92, Spec95, Raw benchmark suite • Dense Matrix Codes, 1 Secure Hash Algorithm

Benchmark Scaling • Benchmark speedups on many tiles relative to the speed of the benchmark on one tile

Effect of Send & Receive Occupancy • 64 tiles • Parameterized network without contention • <n,1, 1, 1, 0> & <0,1,1,1, n>

Effect of Send or Receive Latencies • Applications with courser-grain parallelism are less sensitive to send/receive latencies • Overall, applications are less sensitive to send/receive latencies as compared with send/receive occupancies.

Other Experiments • Increasing Hop Latency • Removing Contention • Comparing with Other networks

Conclusions • Many difficult issues with designing scalar operand networks • Send and receive occupancies have the biggest impact on performance • Network contention, multicast, and send/receive latencies have a smaller impact

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures