410 likes | 909 Views
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group September 19, 2002 Talk at a Glance Motivation Architecture of Internet Routers Raw Processor Overview Raw Router Architecture Switch Fabric Design
E N D
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group September 19, 2002
Talk at a Glance • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
Motivation • Build a fast IP router on a general-purpose architecture Why? • Flexibility new protocols and services • Price economies of scale
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
NetworkProcessor ForwardingEngine ForwardingEngine ForwardingEngine ForwardingEngine Interface Interface Interface Interface SwitchFabric Architecture of Internet Routers
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
Raw Processor Overview • 16 MIPS-like tiles on a single die • 2 Megabytes of SRAM on-chip • Over a thousand signal I/O pins • Over 200 Gbps of external chip bandwidth • Scalable to thousands of tiles!
Raw Communication Mechanisms • Two static networks • Two dynamic networks
Raw Static Networks • Destinations known at compile time • Message size known at compile time • Cycle-by-cycle switch schedule • Three-cycle nearest neighbor send-to-use latency • No processing overhead
Raw Dynamic Networks • Unpredictable events • External asynchronous interrupts • Cache misses • 15- to 30-cycle nearest neighbor send-to-use latency (message header processing overhead)
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
2 1 3 4 Given: Four Networks…
Problem: Mapping? ? StaticInterconnect Dynamic Communication
Solution: Rotating Crossbar Out 0 Out 1 In 0 In 1 In 3 In 2 Out 3 Out 2
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
Rotating Crossbar Highlights • The idea of a Token Ring network absolute fairness • Algorithm uses two static networks, dynamic networks are idle • All deadlock-free configurations are scheduled at compile time • Four headers and token location define a global configuration • Global configuration is computed in a distributed manner at run time
Phases of the Algorithm TILE PROCESSOR SWITCH PROCESSOR headers_request headers send_prev_config choose_new_config route_body confirm update_token
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
Configuration Space • Let’s enumerate the number of configurations: SPACE = |Hdr0| x … x |Hdr3| x |Token|, where |Hdr0| = … = |Hdr3| = 5, and |Token| = 4 therefore SPACE = 54 x 4 = 2,500 distinct configurations
So What?... • Each tile has 8,192 words of instruction memory, same for switch 8,192/2,500 = 3.3 instructions per configuration not enough! need to use off-chip memory slow! need to minimize SPACE
Minimization out cwnext in ccwprev cwprev ccwnext
Outcome of Minimization • We cut down the number of configurations by 78 times! Now there are only 32 entries! the program can fit in the local instruction memory!
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
Implementation • Raw Router was tested in a cycle-accurate simulator of the Raw processor • Raw prototype clock speed is assumed to be 250 MHz • The focus of research is on switch fabric, NOT on route lookup, etc.
We are on… • Motivation • Architecture of Internet Routers • Raw Processor Overview • Raw Router Architecture • Switch Fabric Design • Distributed Scheduling Algorithm • Results and Analysis • Future Work and Conclusion
Future Work • Take advantage of dynamic networks • Implement IP route lookup • Add computation on data (encryption) • Add support of multicast traffic • Implement Quality of Service • Add virtual output queueing • Explore larger router configurations
Conclusion • Implemented a gigabit switch on Raw • Mapped dynamic communication to static interconnect • Can intermix switch fabric with computation • High-bandwidth I/O allows performance of custom ASIC processors