380 likes | 476 Views
Profile-Driven Energy Reduction in Network-on-Chips. 8383 – 2 nd Presentation Ranya Alawadhi. Source. Li, F., Chen, G., Kandemir, M., and Kolcu, I. 2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404. Agenda. Motivation Contribution
E N D
Profile-Driven Energy Reduction in Network-on-Chips 8383 – 2nd Presentation Ranya Alawadhi
Source • Li, F., Chen, G., Kandemir, M., and Kolcu, I. 2007. Profile-driven energy reduction in network-on-chips. SIGPLAN Not. 42, 6 (Jun. 2007), 394-404
Agenda • Motivation • Contribution • Introduction • The Technique • Results • Conclusion
Motivation • Increasing on-chip power consumption demands a power-aware designs • Recent research shows that using voltage/frequency scaling on communication links & shutting down the idle links can significantly reduce NoC power consumption • They work best when communication links have long idle periods
Contribution • A profile-driven compiler optimization for increasing the length of idle periods of communication links for a two-dimensional, on-chip, mesh network by maximizing communication link reuse
Introduction • Targeted application: array/loop-intensive embedded programs • Targeted NoC: two-dimensional mesh used by a single application at a time. • Data tested: 12 data intensive embedded applications • Results: reduces leakage energy by more than 35% on average (as compared to a pure hardware-based link power management scheme)
North North Input Buffer Input Buffer Cross-Bar Cross-Bar Cross-Bar Cross-Bar West West East East Output Buffer Output Buffer Interface to local processing unit Interface to local processing unit Interface to local processing unit Interface to local processing unit South South Architectural modelNetwork Abstraction S S S S NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory NI CPU Memory S S S S S S S S
Architectural modelHardware support for Compiler Directed Message Routing • The compiler attach routing information to each message-send operation in the code • The switch design was extended to handle two types of routing schemes: • Default X-Y routing • Compiler-directed routing
. . . Cont. • Packet header: Flag: indicates which routing mechanism to use flag: 0 X-Y routing Destination Flag
Routing command sequence Counter Orientation Flag . . . . . . Cont. • flag (1 bit): 1 Compiler directed routing • counter (4 bits): Number of hops a long the path • orientation (2 bits): used along with the routing command sequence • routing command sequence (13 bits): tells the switch to which output port to forward the packet
Optimizing Link Reuse Modified Communication Graph Communication Graph Optimized Link Signature Link Signature Optimized Parallel Code Parallel Code Link Reuse Optimizer Profiler Code Rewriter
Network State and Link Signature • Parallel program consists of n parallel threads P1, P2, … , Pn • Pp is scheduled to run on the pth mesh node • Communication command (CC): send operation Cp={M1,p, M2,p, ... , Mk,p, ... , Mq,p} Cp:set of CCs in the program code of Pp Mk,p : kth CC in the code of Pp q: total number of CCs in the program code of Pp • Network State: set of messages under transmission • Si = {Mk,p | A message sent by Mk,p is in transmission over the mesh} • S0 = represents a state in which no message is in transmission
Cont. • Link utilization vector (LUV):is a vector , the jth element gives the number of packets sent by Mk,p and transferred through the jth communication link of the mesh • Link signature (LS):represents the link utilization at a network state Si • Θ( ): is a function that returns the set of links used by LS or LUV
m1,0 l0,1 l0,1 P1 P0 l1,0 l1,0 l1,3 l3,1 l0,2 l2,0 m1,1 l2.3 P3 P2 l3,2 m1,2 Example S1={M1,0,M1,1,M1,2}
Communication Graph • The network transitions from a state, Si, to another state, Sj , in two situations: • A new message is sent by Mk,p Sj = Si U {Mk,p} • A message sent by Mk,p arrives at its destination node Sj = Si − {Mk,p}
S3 S3 S3 S3 S3 S1 S1 S1 S1 300 S5 S5 S5 S5 S5 100 S2 S2 500 400 S4 300 200 Cont. • Communication graph (CG): • Captures the communication behavior of a program • Undirected graph • Vertex: network state • Edge(Si,Sj): transmission between Si, Sj • Weight(Wi,j): number of transition taking place between Si, Sj • Built through profiling
Profiling • Profiler keeps track of the current network state Si • The program notifies the profiler each time a node sends a msg or when a msg arrives its destination • When the notification is received, the profiler computes the new state, Sj, & increases the value of Wi,j
when going from one state to another at runtime, the desire is to reuse the same set of links as much as possible Each vertex in a CG has a default link signature (obtained using the default X-Y routing) The compiler’s task is to re-assign link signatures to vertices Restate the problem
Traversing a Communication Graph • Traversing network states to assign them new link signatures: • Starts with the edge with the largest weight • Performs the signature re-assignment to the associated vertices • Select the next edge: • Scheme I: The one with the largest weight among the edges that are incident on the selected vertices • Scheme II: The one with the largest weight among all the remaining edges • Performs the signature re-assignment • Repeat 3 & 4 until all vertices are processed
S3 S3 S3 S3 S3 S1 S1 S1 S1 300 S5 S5 S5 S5 S5 100 S2 S2 500 400 S4 300 200 Example
Routing Flexibility • Only the shortest paths are considered for re-routing messages • # of possible unique shortest paths= • Source (xs, ys), Destination (xd, yd) • m=|xd-xs| • n=|yd-ys| • Alternate link utilization vectors (ALUV): set of all alternate (shortest) paths available to a message sent by Mi,p (Ai,p) • Re-routing: replacement of the current LUV for an associated Mi,p with a new LUV selected from the corresponding ALUVs • Routing Flexibility = | Ai,p | (i.e. number of alternate link utilization vectors in an ALUV set)
Problem formulation • we can change the associated routing with a CC only once • Selecting the new utilization vectors should not degrade the performance of the default routing scheme • Selecting alternate re-routings can increase the network contention • Performance constraint was introduced for re-routing: avoid increasing the value of the largest entry in any original link signature • For example: • Default LS (10, 40, 10, 10, 0, 0, 0, 0) • Undesirable alternative: (10, 50, 0, 10, 0, 0, 0, 0) • Accepted alternative: (40, 20, 10, 0, 0, 0, 0, 0)
Heuristic • For each Mi,p unassigned with new routing in network state Sa, Sb • Calculate LUV & ALUV • Calculate LS of Sa & Sb • Compute num_links (total # links used in Sa & Sb) • Sort the CC in Sa & Sb into a sequence with ascending routing flexibilities • Start with the CC that has the lowest routing flexibility & assign a proper route to it • Assign the appropriate routes to the CCs, one-by-one, until processing all commands in Sa & Sbis complete Reduce the value as much as possible under the performance constraint
Cont. • The method for choosing a route for a Mi,p (Selects a new LUV for Mi,p by considering all the re-routing options captured in Ai,p) • For each alternate re-routing, checks whether the performance constraint is satisfied with respect to state Sa. • If the performance constraint is met, the new link signature is computed for the state that it belongs to • Recalculate num_links • selects the one that leads to the minimum num_links value • Once a CC is given a new LUV, this command is not considered again when processing the other vertex-pairs
Complexity • Computational Complexity: O(N*K* ) • N: number of network states • K: number of send operations • : largest routing flexibility in an mxn mesh
Coder Rewriter • Responsible for providing a version of the message send operation, which incorporates the compiler-determined routing information Message header for send1,3: 1 0110 11 0001110000000 Message header for send1,7: 1 0100 11 1010000000000
Handling Deadlocks • Re-routings change the behavior of the default X-Y routing scheme • An acyclic channel dependency graph is the necessary and sufficient condition for avoiding deadlocks • Incorporating deadlock handling routine by breaking cycles within the channel dependency graph: • Reduces the probability of experiencing a deadlock at runtime • Cannot completely eliminate deadlocks use the dynamic, hardware-supported deadlock avoidance rule employed by the Alpha 21364 network architecture • Handling deadlocks when they occur results in both extra latency and power consumption
Experiments1)Simulation Environment and Benchmarks • Implement a flit-level on-chip interconnection network simulator • Network: 5x5 configuration • Link speed: 1 Gb/sec • Switch input port buffer size: 64 flits • Flit:128 bit • Packet size: 16 flits • The communication links can be shutdown independently, using a time-out based mechanism • Time-out counter threshold: 1.5 μsec (based on preliminary analysis) • Time to to switch a link to active state: 1 μsec • Energy overhead of switching: 140 μJ
Cont. • Perform experiments with three different versions for each benchmark • employs the default routing • Scheme I • Scheme II • All schemes use the underlying hardware-based link shutdown scheme.
Cont. • Code sizes: 63 - 8,612 C lines • Dataset sizes: 68.9KB - 1,866.4KB • Increase in compilation time (including profiling): 89% (3Step-log) - Lame 236% (Lame) • No deadlock was observed
Cont. Link Utilization Percentage reductions in leakage energy consumption
Cont. Percentage increases in network cycles and overall execution time
Cont. Sensitivity to the number of nodes (Scheme I). The results with Scheme II are similar Sensitivity to the input size (Scheme I). The results with Scheme II are similar
Conclusion • The proposed approach limits links usage into a small set of links to increase the idle period of the remaining links • Hardware schemes are more effective when used with the proposed technique