400 likes | 555 Views
Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al. RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh. Agenda. Introduction Background Topology Packet Switched Time Multiplexed Application Methodology Results Conclusions Wrap-up Questions. Agenda.
E N D
Packet-Switched vs. Time-Multiplexed FPGA Overlay NetworksKapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Introduction • Dedicated spatial interconnect links on a configured FPGA network can be inefficient for sparse communication patterns • Overlaying virtual networks on top of the physical networks can help address this issue
Time-Multiplexed Pros • Can take advantage of global route information Cons • Offline computation can be compute intensive • Must allocate resources for communication schedule and all possible communication between operators
Packet-Switched Pros • No offline setup and resources for storing communication schedule • Routes are made for operators that are actually communicating Cons • Switches more complex • Routes can be less efficient
Novel Contributions of work • Demonstration of efficient and scalable static and dynamic FPGA overlay networks • Quantification of difference between offline scheduling and online routing • Quantification of performance impacts due to balancing interconnects and computing • Characterization of area and performance tradeoffs between time-multiplexed and packet-switched • Quantification of performance difference between time-multiplexed and packet-switched under varying application communication loads.
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
NoC • Early days – on-chip buses • Later necessary to investigate scalable, high-performance, low-overhead on chip networks • Networks are required since buses scale poorly • As the number of PEs increases the communcation increases and more bandwidth is needed
Communication Patterns • Need to know in order to choose network to use • Configured switching is inefficient for apps that underutilize links • Circuit switching is efficient for larger messages on shorter networks • Need to know characteristics in order to make appropriate choice
Packet Switched How they improve on past work in FPGA-based overlay networks • Allow arbitrary topolgies • Use real applications and relistic PE architectures to generate traffic payloads • Network speed is much faster running at 166 MHz as compared to most running at 25-50 MHz
Time Multiplexed • Use a greedy router similar to the one used in the Virtual Wires project • Virtual Wires overcame pin limitation by time sharing each physical wire among logical wires and pipelining • This paper attempts to explore the entire design space as opposed to one system size or config
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Performance Analysis Several important quantities of the network have to be defined • PE Input Serialization A bound of cycle count for input • PE Output Serialization A bound of cycle count for output • Network Bisection Maximum number of messages that can cross the network on a given cycle • Network Latency Number of cycles required to cross the network
Butterfly Fat Trees • Most FPGA NoCs have focused on meshes • BFTs achieve higher performance at equivalent chip size • Routing functions programmed in the split primitives determine path • Single address bit is used to make a routing decision at each switch • Time-multiplexed merge contains a context memory which stores computed routing
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Packet Switched • Primitives have input queues • Split primitives computes the routing decision in a single cycle based on the destination address • Arbitration is done by selecting packets based on input queue occupancies • Network with floorplaned and pipelined primitives can operate as high as 180 MHz
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Time Multiplexed • Statically scheduled prior to runtime • Switching primitives contain context memory • Context memory requires 1 bit of storage per cycle • Network capable of operating at 166 MHz • Greedy routing algorithm used
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Application • A real life application was mapped onto the networks • ConceptNet – common-sense reasoning knowledge base represented as a graph • Start with a inititial set of nodes, send activation from each node to it’s neighbors along weighed edges • Time multiplexed run at 100% activity packet switched run between 1-100% activity level • Limitations • Nodes limited to 128 edges of fanout or fanin • Can only process a single edge per cycle
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Methodology • Java based infrastructure • simulates the packet switched network • computes schedules for time multiplexed network • Used smallest set of ConceptNet predicates • Java infrastructure generates VHDL netlist • Hand coded VHDL for ConceptNet PEs • Created custom multipliers instead of using onboard for speed
Methodology (cont) • Synthesis and place and routing using Synplicity Compiler v8.0 • Xilinx ISE v8.1i to obtain operating frequency and slice count • Long wires that constrain performance are further pipelined based on post place-and-route timing analysis • Lots of intervention to prepare system
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Results Three quantitative comparisons are provided to characterize the tradeoffs between packet switched and time multiplexed networks • Routing of identical topologies • Impact of area with identical area constraints • Examine performance while varying activity level (Activity Factors)
Routing identical topologies • Small numbers of PEs induce a light communication load • As PEs , communication and offline routing starts to outperform online routing • Online routing requires up to 63% more cycles than offline routing for larger networks
Impact of Area • A couple of things to consider when talking about area • PE vs. Interconnect Tradeoff • Area-Time Tradeoff
PE vs. Interconnect Tradeoff Sometimes the network performs better with less PEs but more capacity in the network.
Area-Time Tradeoff • Packet switched and time multiplexed networks may use significantly different amounts of area due to differences in switch sizes • At smaller areas time multiplexing requires more cycles • At higher cycle counts time multiplexing requires more area for context • Performance is limited by 128 edge fanin or fanout limit
Activity Factors • Packet-switching takes 8x as many cycles to route • At some activity factors less than 100% packet-switching should be able to outperform time-multiplexing for same area
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Conclusions • Demonstrated implementations of packet-switched and time-multiplexed FPGA overlay networks operating at 166 MHz • Offline scheduling offers up to a 63% performance increase over online scheduling for equivalent topologies • Packet-switching is up to 2x faster for small areas • Time-multiplexing is up to 8x faster for large areas
Conclusions (cont.) For activity factors less than 30% or 5%, packet switching offers better performance At 32K slices and 100K slices respectively
Future Work • Mapping larger communication graphs with smaller fanout limitations to fully test networks • Compress context memory for time-multiplexing • Improve efficiency of packet switching • Extend work to multiple-chip networks
Agenda • Introduction • Background • Topology • Packet Switched • Time Multiplexed • Application • Methodology • Results • Conclusions • Wrap-up • Questions
Wrap-up • Paper takes a look at trade-offs involved in FPGA networks • Thought it was a good look at design decisions and gave actual guidance to the designer • Describes interesting alternative to mesh network (BFTs)