230 likes | 344 Views
Amalgam: a Reconfigurable Processor for Future Fabrication Processes. Nicholas P. Carter University of Illinois at Urbana-Champaign. Performance = f(architecture, implementation). LD. LD. ADD. MUL. LD. MUL. ST. LD. MUL. ST. LD. ADD. MUL. LD. MUL. ST. ST. LD. LD. ADD. MUL.
E N D
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Performance = f(architecture, implementation) LD LD ADD MUL LD MUL ST LD MUL ST LD ADD MUL LD MUL ST ST LD LD ADD MUL ADD MUL ADD MUL ST ST 1-D IDCT 1-D IDCT 1-D IDCT 1-D IDCT Time Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Efficient Implementation • Everything you give up in clock rate you have to make back in architectural efficiency • Wire delay is the big limiting factor in system architectures today • Wires get slower relative to transistors as fab. process improves • Programmable processors moving to deeper pipelines • Not good enough to just prevent wires from making reconf. logic slower Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Amalgam DRAM Cache (Multi-Banked) Network PCluster PCluster PCluster PCluster RCluster RCluster RCluster RCluster Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Reconfigurable Cluster Design • 4 Register banks • 8 registers/bank • 4 Reconfigurable logic segments • 8 Rows x 32 LBs per segment • Array control unit • Network interface • Counter-clockwise flow of computation through cluster Network Interface Segment ACU Bank Segment Bank Segment Bank Segment Bank Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Reconfigurable Clock Rates Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Critical Path • Latches in logic blocks only resource for pipelining • Vertical and horizontal wires carry data between logic blocks • Wires have heavy loads, making them slower than their length would indicate • Effect on clock rate varies significantly with fabrication process LB FF HWIRE VWIRE Bank HWIRE VWIRE LB FF Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Supporting Pipelining • Goal: make logic block delay the limiting factor on clock rate • Add configurable latches at each wire intersection • Problem: different paths may have different latencies • Add retiming buffers at logic block inputs/outputs • Add network queues to reduce synchronization overhead Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Critical Path • Delay of individual wires < logic block delay in all processes studied • Add configurable pipeline latches at junctions between wires • Pipeline latches also added on carry chains within rows LB FF FF HWIRE VWIRE Bank FF HWIRE VWIRE FF LB FF FF Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Retiming Buffers • 5-deep chain of latches added to each logic block input • Similar structure added to LB output • Can “borrow” up to two cycles of additional delay from adjacent input • Total pipeline register overhead = 17% FF FF FF FF FF FF FF FF FF FF Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Register Queues Original Architecture Original Architecture Network Network WRITE R8, Val1 WRITE R8, Val2 WRITE R8, Val1 WRITE R8, Val2 Sync. Message Register Queue EMPTY R8 Register File Register File Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Implementing Pipelined Apps. • Logical vs. Physical pipelining • Logical: Program-visible, uses array and registers • Physical: Only visible to ACU, uses pipeline registers on wires, retiming buffers • Take advantage of decoupling provided by queues • Applications use same reconfigurable logic configurations in different fab. processes • Only FSM in ACU changes • Applications to portability, managing intra-die variation Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Experimental Methodology • Programs simulated using Amalsim • Set each cluster’s clock rate independently • Benchmarks: IDCT, Rijndael, DNA comparison • Fine-grained version of each benchmark does one computation • Medium-grained version performs four independent computatons • Programmable cluster clock rates based on ITRS • Limit stages to 7 FO4 delay, slightly more aggressive than ITRS • Logic block latencies, wire lengths taken from circuit-level design of reconf. Cluster in 180nm CMOS • Convert logic block delay to FO4, scale by FO4 delay of each fabrication process • Scale wire length based on fabrication process, simulate wire delay in SPICE • Pipeline such that reconf. cluster cycle time is determined by logic block delay Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Clock Rates Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Fine-Grained Benchmark Perf. • Reconfigurable version maintains about 20% perf. Improvement over programmable in all fab. processes • Pipelining only small benefit • Majority of speedup comes from reduction in memory references Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Medium-Grain Benchmark Perf. • Pipelined architecture sees 2.6x perf improvement over programmable • Unpipelined architecture only minor improvement over programmable • Greater parallelism means more ability to tolerate memory delays Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Limit Studies • Believe that memory operations are much of the benefit for small tasks • Study limit where memory latency = 1 • Also test theory that streaming benchmarks have enough parallelism to cover latency • Understand how much clock rate of reconfigurable unit affects performance • Model reconfigurable unit at same clock rate as programmable clusters • Completely unreasonable for unpipelined • Might be indicator of what industry could do with pipelined Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Fine-Grained • Removing memory latencies makes programmable performance similar to reconfigurable • Latency of reconfig. clusters has large impact on performance -- no parallelism to cover latency Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Fine-Grained • Results similar to unpipelined • Benefit still mostly from memory reduction Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Medium-Grain • Eliminating memory latencies really helps programmable • Latency of reconf. logic an even bigger problem • Programmable clusters can exploit parallelism through pipelines Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Medium-Grain • Impact of memory system on reconfigurable performance very small • Less benefit from increasing reconfigurable cluster clock rate • With even small amounts of parallelism, throughput becomes more important than latency. Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Future Directions • ASIC-like performance with programmable systems • ASICs typically get 100x better performance per unit area than microprocessors • Application-specific memory systems in a programmable chip • Transform memory references into communication • Create natural division of programs into regular and irregular blocks Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Conclusion • Reconfigurable computing must provide both speedup from custom logic and high clock rates to succeed • Amalgam does this by limiting and tolerating wire delay at multiple levels • Clustered architecture • Segmented reconfigurable unit • Pipeline wire delays • Result: 2.6x speedup over 8-way CMP in current and future fabrication processes Amalgam: a Reconfigurable Processor for Future Fabrication Processes