CDC/CRA CHiPs Mentoring Workshop High Performance Interconnects

CDC/CRACHiPs Mentoring WorkshopHigh Performance Interconnects Timothy M. Pinkston Professor, USC July 25-27, 2009

My Background • Education: • BSEE (minor in CS): The Ohio State Univ., ’85 • MSEE (Computer Engineering): Stanford U., ’86 • PhDEE (Computer Engineering, Comp Arch): Stanford U., ’93 • Experience: • Industry: AT&T Bell Labs, ’85-’86; IBM Intern, ‘89-’90 (summers); Hughes Research Labs (HRL) Doctoral Fellow ’90-’93 • Academia: University of Southern California ’93 - present • Government: NSF, Jan. ‘06 – Dec. ‘08 • Research Interests: • Computer systems architecture: interconnection networks, on-chip networks for multicore and multiprocessor systems • Recent Activities: • “Interconnection Networks” with Jose Duato , book chapter in Computer Architecture: A Quantitative Approach, 4th edition, J. L. Hennessy and D. A. Patterson (2006) • Lead Program Director for Expeditions in Computing program: NSF CISE, $40M award portfolio in inaugural year (2008)

End Node End Node End Node End Node Device Device Device Device … SW Interface SW Interface SW Interface SW Interface HW Interface HW Interface HW Interface HW Interface … Interconnection Networks • The subsystem that connects individual devices together into a community of communicating devices • Device (End Node): • Component in a computer • A computer • System of computers • Interconnection Network: • Interfaces and Links • Communication protocol • Routers (switches) • Goal: Transfer maximum amount of data reliably in least amount of time (& energy, cost) so as not to bottleneck overall system performance … Link Link Link Link Link Link Link Link Router Router Router Router Interconnection Network Router Router Router Router Internetworking

Different Networks for Different Scales Wide-Area Networks (WANs) Local Area Networks (LANs) System-Area Networks (SANs) On-Chip Networks (OCNs) WANs 5 x 106 LANs 5 x 103 Distance (meters) SANs 5 x 100 OCNs 5 x 10-3 1 10 100 1,000 10,000 >100,000 Number of devices interconnected

2um CPU FPU CPU FPU CPU FPU P P P P CPU CPU CPU CPU L1 L1 L1 L1 L1 L1 MC L1 M L2 L2 L2 L2 L2 Minimum Feature Size (um) MC MC 1um multicore era! 0.35um 0.18um CPU 0.09um FPU L1 L2 MC L3 + MC L3 Year of Technology Availability Increasing Parallelism on Chips • Level 1 • Level 2 • Level 3 • Level 4 many-core chips 0.045um CPU FPU CPU FPU L1 L1 L2 L2 L3 + MC L3 + MC (adapted from Nhon Quach, Intel)

Increasing Parallelism in Systems IBM Blue Gene (www.ibm.com) • Blue Gene/L 3D Torus Network 3-dimensional (XYZ) torus interconnection network (10s to 100s thousand devices)

fault resilient designs technology scaling fault resilient designs ITRS’04 technology scaling Defects, Faults, Chip Yield and Lifetime Trends in chip (system) failure rate failure rate time infant mortality period useful lifetime period aging period - Technology scaling adversely impacts chip yield and chip/system failure rate (manufacturing defects, soft and hard faults, wear-out lifetime) - Adaptive, self-correcting, self-repairable architectures are needed to combat decreasing chip reliability with successive technology generations - Intel predicts at least 5-10% of chip resources will be used for ensuring reliability (Source: “Platform 2015” www.intel.com/go/platform2015)

Transporting Packets within a Network • Goal: Transfer maximum amount of data reliably in least amount of time (& energy, cost) so as not to bottleneck overall system perf. • Network Structure and Functions for Transporting Data Packets • Topology: What network paths are possible for packets? • Routing: Which of the possible paths are allowable for packets? • Flow Control & Arbitration: When are paths available for packets? • Switching: How are paths allocated to packets? • Router Microarchitecture: Implementation of router internal paths

X Queue is not serviced Flow Control of Data Packets • Poor flow control can reduce link efficiency • “Handshaking” flow control Receiver transmits handshake when ready for next packet Sender can transmit only after receiving handshake signal Handshake sender receiver queued packets buffer queue data link control link Router Router

X Queue is not serviced Flow Control of Data Packets • Poor flow control can reduce link efficiency • “Handshaking” flow control • simple, but low throughput and high latency Receiver transmits handshake when ready for next packet Sender can transmit only after receiving Handshake signal Handshake sender receiver non-pipelined transfer queued packets buffer queue data link control link Router Router

X Queue is not serviced a packet is injected if control bit is a “Go” Flow Control of Data Packets • Poor flow control can reduce link efficiency • “Stop & Go” flow control When Stop threshold is reached, a Stop notification is signaled Go Stop sender receiver Control bit Stop pipelined transfer queued packets buffer queue Go data link When in Stop, sender cannot inject packets control link Router Router

X Queue is not serviced Flow Control of Data Packets • Poor flow control can reduce link efficiency • “Stop & Go” flow control • improved throughput and latency if large enough buffer queues a packet is injected if control bit is a “Go” When Go threshold is reached, a “Go” notification is sent Go Stop sender receiver Control bit Stop pipelined transfer queued packets buffer queue Go data link control link Router Router

X Queue is not serviced Flow Control of Data Packets • Poor flow control can reduce link efficiency • “Credit-based” flow control Sender sends packets whenever credit counter is not zero sender receiver Credit counter 3 10 9 8 7 5 4 1 2 0 6 pipelined transfer queued packets buffer queue data link control link Router Router

X Queue is not serviced Flow Control of Data Packets • Poor flow control can reduce link efficiency • “Credit-based” flow control • improved throughput and latency with smaller buffer queues Sender resumes injecting when credit counter > 0 Receiver sends credits after they become available sender receiver Credit counter 7 6 5 4 3 2 9 10 8 5 4 3 2 0 1 pipelined transfer queued packets buffer queue +5 data link control link Router Router

Flow Control of Data Packets • Improving flow control with split buffer organizations • A router with a single buffer queue per input port Output port X+ X+ Input port i buffer queue Output port X- Y- X- Y+ Y- X+ queued packets Output port Y+ Output port Y- K x K Router

physical channel X Y+ X+ X- Y- Flow Control of Data Packets • Improving flow control with split buffer organizations • Head-of-line blocking in a router with single queue per input port X 2-dimensional mesh network with dimension-order routing Router

Flow Control of Data Packets • Improving flow control with split buffer organizations • A router with two queues per input port  two virtual channels Output port X+ X+ Input port i split buffer queue X- X+ Output port X- DEMUX Y- Y+ Y- Output port Y+ queued packets Output port Y- K x K Router

virtual channel 0 virtual channel 1 X Y+ X+ X- Y- Flow Control of Data Packets • Improving flow control with split buffer organizations • HoL blocking reduced in a router with two queues per input port 2-dimensional mesh network with dimension-order routing (two virtual channels/physical) Router X

virtual channel 0 virtual channel 1 X No VCs available Y+ X+ X- Y- Flow Control of Data Packets • Improving flow control with split buffer organizations • HoL blocking not eliminated in a router with virtual channels X 2-dimensional mesh network with dimension-order routing (two virtual channels/physical) Router X

Flow Control of Data Packets • Improving flow control with split buffer organizations • A router with virtual output queuing (VOQ) requires k queues Output port X+ Input port i Split buffer queue X+ X+ Output port X- X- DEMUX Y+ Y- Y- Output port Y+ queued packets Output port Y- K x K Router

Y+ Y+ X+ X+ Y- Y- X- X- X Y+ X+ X- X Y- Flow Control of Data Packets • Improving flow control with split buffer organizations • HoL blocking eliminated at router with VOQ • , but not at neighbor HOL blocking at neighboring router!! 2-dimensional mesh network with dimension-order routing (two virtual channels/physical) Router

skyline region root newroot Resilient Interconnection Networks • - A 2-D mesh network with XY routing (deadlock-free) • Reduce chip-kill in the presence of permanent faults with dynamic reconfiguration of on-chip networks • - If a core’s router & link is faulty → causes five failed links • - Network can be dynamically reconfigured to up*/down* (u*/d*) routing remaining deadlock-free! • - Later, if the u*/d* root fails → causes four links to fail • - Only the up*/down* link directions within the skyline region are affected by the fault • - Reconfigured again to regain connectivity  no chip-kill!! Many such fascinating problems in need of innovative solutions!

In Conclusion • Interconnectionnetworks are keytoexploitingparallelism • on-chip networksbetweencoreswithin a chip • off-chip networksbetween chips and boardsacross a system • Many open researchquestionsremain: • networktopology, routing, arbitration, switching, and flow control designsthatmaximizethroughput and minimizelatency • innovative resource management techniques that enable adaptive, power-aware, fault-resilient, reliable interprocessor communication • the list goes on … • High performance interconnection network designisanexcitingarea of computersystemsarchitectureresearch • Thefutureawaits!

Interconnect Media & Form Factors OCNs SANs LANs WANs Metal layers InfiniBand connectors Ethernet Fiber Optics Media types Cat5E twisted pair Coaxial cables Printed circuit boards Myrinet connectors 10 100 0.01 1 >1,000 Distance (meters)

Stop Stop Stop Stop Go Go Go Go Stop signal returned by receiver Sender stops transmission Last packet reaches receiver buffer Packets in buffer get processed Go signal returned to sender Sender resumes transmission First packet reaches buffer Stop & Go Time Sender uses last credit Last packet reaches receiver buffer Pacekts get processed and credits returned Sender transmits packets First packet reaches buffer # credits returned to sender Flow control latency observed by receiver buffer Credit based Time Flow Control of Data Packets • Comparison of “Stop & Go” with “Credit-based”

CDC/CRA CHiPs Mentoring Workshop High Performance Interconnects