290 likes | 388 Views
Trace-Driven Optimization of Networks-on-Chip Configurations. Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California, San Diego CSE † and ECE ‡ Departments June 16, 2010. 1. Outline. Motivation Trace-Driven Problem Formulation
E N D
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng†‡ Bill Lin‡ Kambiz Samadi‡ Rohit Sunkam Ramanujam‡ University of California, San Diego CSE† and ECE‡ Departments June 16, 2010 1
Outline • Motivation • Trace-Driven Problem Formulation • Greedy Addition VC Allocation • Greedy Deletion VC Allocation • Runtime Analysis • Experimental Setup • Evaluation • Experimental Results • Power Impact • Conclusions 2
Processing Element Router Motivation • NoCs needed to interconnect many-core chips • Scalable on-chip communication fabric • An emerging interconnection paradigm to build complex VLSI systems • NoCs can be used to interconnect general-purpose chip multiprocessors (CMPs) or application-specific multiprocessor systems-on-chip (MPSoCs) 3
CMPs vs. MPSoCs Traditional application domains MPSoCs target embedded domains CMPs target general purpose computing Common Need for high memory bandwidth Power efficiency, system control, etc. Different CMPs are to run a wide range of applications MPSoCs have more irregularities MPSoCs have tighter cost and time-to-market Conclusion: application-specific optimization is required for MPSoCs 4
Trace Driven vs. Average-Rate Driven Actual traffic behavior of two PARSEC benchmark traces Actual traffic tends to very bursty with substantial fluctuations over time Average-rate driven approaches are misled by the average traffic characteristics poor design choices Our approach: trace-driven NoC configuration optimization 5
Head-of-Line (HOL) Blocking Problem • HOL happens in input-buffered routers • Flits are blocked if the head flit is blocked significantly increases latency and reduces throughput • Virtual channels overcome this problem by multiplexing the input buffers Output 1 Output 2 Blocked! Output 3 Output 1 Output 2 Output 3 6
Average-Rate Driven Shortcoming 3 packets with the following (source, destination): (A, G), (B, E), (F, E) Suppose all 3 packets are 10-flits in size, and all injected at t = 0 Channels 2 and 3 will carry two packets from (A, G) and (B, E), and Channel 4 will also carry two packets from (B, E) and (F, E) Average-rate analysis concludes that adding an additional VC to Channels 2 and 3 is as good as adding a VC to Channel 4 since all 3 channels have the same “load” Average-rate driven approaches lead to poor design choices E 4 A B C D F 1 2 3 5 6 G (F, E) (B, E) (A, G) 7
Wormhole Configuration At “t = 1”, the above channels color coded are held by each packet, assuming single VC (i.e. wormhole routing) At “t = 2”, Packet (A, G) is “blocked” from proceeding because Channel 2 already held by packet (B, E) At “t = 12” (=3 + 9), packet (B, E) can proceed to Channel 4 since it has already been released by packet (F, E) At “t = 20”, Packet (A, G) acquires Channel 3 At “t = 21”, Packet (A, G) acquires Channel 6 as well, and Packet (B, E) completes E 4 A B C D F 1 2 3 5 6 G Packet (A, G) will complete at “t = 35” 8
Latency Reduction via VC Allocation Now assume Channels 2 and 3 each have 2 VCs In this case, Packet (A, G) can “bypass” Packet (B, E) while packet (B, E) is being blocked by packet (F, E) at Channel 4 At “t = 12”, Packet (F, E) completes, and Packet (B, E) can proceed on Channel 4 At “t = 13”, last flit of Packet (A, G) is at Channel 6 At “t = 22”, last flit of Packet (B, E) is at Channel 4, and Packet (A, G) has already completed E 4 A B C D F 1 2 3 5 6 G • With 2 VCs at Channels 2 and 3, completion time is 23 cycles vs. 35 cycles without these VCs • Main reason for the improvement is because we prevented Channels 2 and 3 from being “idle” 9
Outline • Motivation • Trace-Driven Problem Formulation • Greedy Addition VC Allocation • Greedy Deletion VC Allocation • Runtime Analysis • Experimental Setup • Evaluation • Experimental Results • Power Impact • Conclusions 10
Problem Formulation • Given: • Application communication trace, Ctrace • Network topology, T(P,L) • Deterministic routing algorithm, R • Target latency, Dtarget • Determine: • A mapping from nVC from the set of links L to the set of positive integers, i.e., nVC : L → Z+, where for any l L, nVC(l) gives the number of VCs associated with link l • Objective: • Minimize • Subject to: 11
Greedy Addition VC Allocation Heuristic (1) Inputs: Communication traffic trace, Ctrace Network topology, T(P,L) Routing algorithm, R Target latency, Dtarget Output: Vector nVC, which contains the number of VCs associated with each link (3,0) (3,1) (3,2) (3,3) (2,0) (2,1) (2,2) (2,3) (1,0) (1,1) (1,2) (1,3) (0,0) (0,1) (0,2) (0,3) time src dest packet size 1 (1,0) (0,2) 4 2 (2,2) (3,1) 4 5 (1,3) (3,1) 4 7 (2,1) (3,2) 4 … 12
Greedy Addition VC Allocation Heuristic (2) Algorithm initializes every link with one VC Algorithm proceeds in greedy fashion In each iteration, performance of all VC perturbations are evaluated Each perturbation consists of adding exactly one VC to one link Average packet latency (APL) of perturb VC configurations are evaluated the configuration with the smallest APL is chosen for the next iteration Algorithm stops if either (1) the total allocated VCs exceeds the VC budget, or (2) a configuration with better APL than the target latency is achieved 13
Greedy Addition VC Allocation Heuristic (3) fori = 1 to NL nVCcurrent(l) = 1; end for nVCbest = nVCcurrent; NVC = NL; while (NVC <= budgetVC) forl = 1 to NL nVCnew = nVCcurrent; nVCnew(l) = nVCcurrent(l) + 1; run trace simulation on nVCnew and recordD(nVCnew,R) end for findnVCbest; nVCcurrent = nVCbest; if (D(nVCnew,R) <= Dtarget) break; end if NVC++; end while initializing to wormhole configuration check the VC budget VC perturbations evaluated in parallel in each iteration find the best configuration of the current iteration 14
Greedy Addition VC Allocation Heuristic Drawback Packets (A, F) and (A, E) share links A→B and B→C, both of which have only one VC (A, F) turns west and (A, E) turns east at Node C adding a VC to either link A→B or link B→C may not have a significant impact on APL If VCs are added to both links A→B and B→C, the APL may be significantly reduced Greedy VC addition approach may fail to realize the benefits of these combined additions and not pick either of the links F C D E (A, E) (A, F) B A 15
Greedy Deletion VC Allocation Heuristic nVCcurrent = nVCinitial; nVCbest = nVCcurrent; NVC = nVCcurrent(l); while (NVC>= budgetVC) for l = 1 to NL nVCnew = nVCcurrent; if (nVCcurrent(l) > 1) nVCnew(l) = nVCcurrent(l) - 1; run trace simulation on nVCnew and recordD(nVCnew,R) end if end for findnVCbest; nVCcurrent = nVCbest; if (D(nVCnew,R) <= Dtarget) break; end if NVC--; end while start with a given VC configuration each link should at least have 1 VC, i.e., wormhole configuration find the best configuration of the current iteration, i.e., the one with least degradation in APL 16
Addition and Deletion Heuristics Comparison • APL decreases as VCs are increased (addition heuristic) • APL increases as VCs are removed (deletion heuristic) • Adding a single VC to a link may not have a significant impact on APL • APL change is much smoother in deletion heuristic 17
Runtime Analysis • Let m be the number of VCs added to (deleted from) an initial VC configuration • Theuristic = m × NL × T(trace simulation) • Theuristic is the average time to run trace simulations on all VC configurations explored by the algorithm • Our heuristics can easily be parallelized • Evaluating all VC configurations in parallel • Theuristic = m × T(trace simulation)max • represents the average of the maximum runtimes of trace simulation at each iteration • For Larger networks, to maintain a reasonable runtime we need O(L) processing nodes • Trace compression • Other metrics to more efficiently capture the impact of VC on APL 18
Outline • Motivation • Trace-Driven Problem Formulation • Greedy Addition VC Allocation • Greedy Deletion VC Allocation • Runtime Analysis • Experimental Setup • Evaluation • Experimental Results • Power Impact • Conclusions 19
Experimental Setup (1) We use Popnet for trace simulation Popnet models a typical four-stage router pipeline Head flit of a packet traverses all four stages, while body flits bypass the first stage Number of VCs at the input port can be individually configured to allow nonuniform VC configuration at a router Latency of a packet is measured as the delay between the time the head flit is injected into the network and the time the tail flit is consumed at the destination Reported APL value is the average latency over all packets in the input traffic trace communication trace GEMS Simics workload network configuration 20
Experimental Setup (2) To evaluate our VC allocation heuristics we use seven different applications from PARSEC benchmark suite Network traffic traces are generated by running the above applications on Virtutech Simics GEMS toolset is used for accurate timing simulation We simulate a 16-core, 4x4: 21
Outline • Motivation • Trace-Driven Problem Formulation • Greedy Addition VC Allocation • Greedy Deletion VC Allocation • Runtime Analysis • Experimental Setup • Evaluation • Experimental Results • Power Impact • Conclusions 22
Comparison vs. Uniform-2VC Average-rate driven method is outperformed by uniform VC allocation Our addition and deletion heuristics achieve up to 36% and 34% reduction in number of VCs, respectively (w.r.t. uniform-2VC configuration) On average, both of our heuristics reduce the number of VCs by around 21% across all traces (w.r.t. uniform-2VC configuration) 23
Comparison vs. Uniform-3VC Our addition and deletion heuristics achieve up to 48% and 51% reduction in number of VCs, respectively On average, our addition and deletion heuristics achieve up to 31% and 41% reduction in number of VCs across all traces We observe up to 35% reduction in number of VCs compared against an existing average-rate driven approach 24
Latency and #VC Reductions With #VC=128 our greedy deletion heuristic improves the APL by 32% and 74% for fluidanimate and vips traces compared with the uniform-2VC configuration, respectively Our deletion heuristic also achieves 50% and 42% reduction in number of VCs compared with uniform-4VC configuration, respectively Our proposed trace-driven approach can potentially be used to (1) improve performance within a given power constraint, and (2) reduce power within a given performance constraint Latency reduction Latency reduction VC reduction VC reduction fluidanimate trace vips trace 25
Impact on Power We use ORION 2.0 to assess the impact of our approach on power consumption ORION 2.0 assumes same number of VCs at every port in the router Need to compute the router power for nonuniform VC configurations Estimate the power overhead of adding a single VC to all router ports Estimate the power overhead of adding a single VC to just one port Similar approach is used to estimate the area overhead of adding a single VC to one router port We observe that our proposed approach achieves up to 7% and 14% reduction in power compared against uniform-2VC and uniform-3VC configurations (without any performance degradation), respectively Similarly, we observe up to 9% and 16% reduction in area compared against uniform-2VC and uniform-3VC configurations, respectively 26
Outline • Motivation • Trace-Driven Problem Formulation • Greedy Addition VC Allocation • Greedy Deletion VC Allocation • Runtime Analysis • Experimental Setup • Evaluation • Experimental Results • Power Impact • Conclusions 27
Conclusions Proposed trace-driven method for optimizing NoC configurations Considered the problem of application-specific VC allocation Showed that existing average-rate driven VC allocation approaches fail to capture the application-specific characteristics to further improve performance and reduce power In comparison with uniform VC allocation, our approaches achieve up to 51% and 74% reduction in number of VCs and average packet latency, respectively In comparison with an existing average-rate driven approach, we observe up to 35% reduction in number of VCs Ongoing work New metrics to more efficiently capture the impact of VC allocation on average packet latency New metaheuristics to further improve our performance improvement and VC reduction gains 28
Thank You 29