1 / 24

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology. Line Card (LC). Arbitration. Packet Processor. Control Processor. SRAM (lookup table). Switch. Queue Manager. IP Lookup. Exit functions. LC. LC.

doria
Download Presentation

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology http://csg.csail.mit.edu/6.375

  2. Line Card (LC) Arbitration Packet Processor Control Processor SRAM (lookup table) Switch Queue Manager IP Lookup Exit functions LC LC LC IP Lookup block in a router A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing table Line rate and the order of arrival must be maintained line rate  15Mpps for 10GE http://csg.csail.mit.edu/6.375

  3. 7.14.*.* A 7 B … … … … … … … … … … … … … … … C A A A C A F F F F F F F F F 7.14.7.3 B 10.18.200.* C 10.18.200.5 D 5.*.*.* E * F 5 D A 4 Sparse tree representation 0 14 3 5 E F 7 10 18 255 200 2 3 In this lecture: Level 1: 16 bits Level 2: 8 bits Level 3: 8 bits  1 to 3 memory accesses 1 4 http://csg.csail.mit.edu/6.375

  4. 0 0 0 … … … 28 -1 28 -1 … 216 -1 “C” version of LPM int lpm (IPA ipa) /* 3 memory lookups */ { int p; /* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p); /* Level 2: 8 bits */ p = RAM [ptr(p) + ipa [15:8]]; if (isLeaf(p)) return value(p); /* Level 3: 8 bits */ p = RAM [ptr(p) + ipa [7:0]]; return value(p); /* must be a leaf */ } Not obvious from the C code how to deal with - memory latency - pipelining Memory latency ~30ns to 40ns Must process a packet every 1/15 ms or 67 ns Must sustain 3 memory dependent lookups in 67 ns http://csg.csail.mit.edu/6.375

  5. Rigid pipeline Linear pipeline Inefficient memory usage but simple design Efficient memory usage through memory port replicator 1 2 3 Longest Prefix Match for IP lookup:3 possible implementation architectures Circular pipeline Efficient memory with most complex control Designer’s Ranking: Which is “best”? http://csg.csail.mit.edu/6.375 Arvind, Nikhil, Rosenband & Dave [ICCAD 2004]

  6. IP-Lookup module: Circular pipeline getToken cbuf getResult • Completion buffer ensures that departures take place in order even if lookups complete out-of-order • Since cbuf has finite capacity it gives out tokens to control the entry into the circular pipeline • The fifo must also hold the “token” while the memory access is in progress: Tuple2#(Token,Bit#(16)) enter put done? RAM no fifo remainingIP http://csg.csail.mit.edu/6.375

  7. Completion buffer: Interface cbuf getToken getResult put (result & token) interfaceCBuffer#(type t); methodActionValue#(Token) getToken; methodAction put(Token tok, t d); methodActionValue#(t) getResult; endinterface typedef Bit#(TLog#(n)) TokenN#(numeric type n); typedef TokenN#(16) Token; http://csg.csail.mit.edu/6.375

  8. ctr++ Ready (ctr > 0) Ack ctr deq deq enq Enable Data Ready Synch Mem Latency N req ctr-- peek Data Addr Request-Response Interface for Synchronous Memory interface Mem#(type addrT, type dataT); method Action req(addrT x); method Action deq; method dataTpeek; endinterface Making a synchronous component latency- insensitive http://csg.csail.mit.edu/6.375

  9. IP-Lookup module: Interface methods getToken cbuf getResult enter put done? RAM no fifo module mkIPLookup(IPLookup); rule recirculate… ; method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod method ActionValue#(Msg)getResult(); let result <- cbuf.getResult; return result; endmethod endmodule When can enter fire? cbuf,ram & fifo each has space http://csg.csail.mit.edu/6.375

  10. Circular Pipeline Rules: enter? done? RAM done? Is the same as isLeaf fifo When can recirculate fire? rulerecirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq; if(isLeaf(ram.peek)) cbuf.put(tok, ram.peek); else begin fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek+ rip[15:8]); end ram & fifo each has an element and ramand fifo, or cbuf has space Requires simultaneous enq and deq in the same rule! Is this possible? http://csg.csail.mit.edu/6.375

  11. Dead Cycles enter? done? RAM method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod fifo Can a new request enter the system when an old one is leaving? rulerecirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq; if(isLeaf(ram.peek)) cbuf.put(tok, ram.peek); else begin fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek+ rip[15:8]); end No Is this worth worrying about? http://csg.csail.mit.edu/6.375

  12. The Effect of Dead Cycles yes enter Circular Pipeline • RAM takes several cycles to respond to a request • Each IP request generates 1-3 RAM requests • FIFO entries hold base pointer for next lookup and unprocessed part of the IP address done? RAM no fifo What is the performance loss if “exit” and “enter” don’t ever happen in the same cycle? >33% slowdown! Unacceptable http://csg.csail.mit.edu/6.375

  13. So is there a dead cycle? enter? done? RAM method Action enter (IP ip); Token tok <- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod fifo rulerecirculate; match{.tok,.rip} = fifo.first; fifo.deq; ram.deq; if(isLeaf(ram.peek)) cbuf.put(tok, ram.peek); else begin fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek+ rip[15:8]); end In general these two rules conflict but when isLeaf(p) is true there is no apparent conflict! Assuming cbuf.getToken and cbuf.put are CF http://csg.csail.mit.edu/6.375

  14. Rule Spliting rule fooT (p); r1 <= 5; endrule rule fooF (!p); r2 <= 7; endrule rule foo (True); if (p) r1 <= 5; else r2 <= 7; endrule  rule fooT and fooF can be scheduled independently with some other rule http://csg.csail.mit.edu/6.375

  15. Splitting the recirculate rule This rule is a valid only if enqand deq can be enable simultaneously and execute concurrently rulerecirculate(!isLeaf(ram.peek)); match{.tok,.rip} = fifo.first; fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek+ rip[15:8]); fifo.deq; ram.deq; endrule method Actionenter (IP ip); Token tok<- cbuf.getToken; ram.req(ip[31:16]); fifo.enq(tuple2 (tok,ip[15:0])); endmethod rule exit (isLeaf(ram.peek)); match{.tok,.rip} = fifo.first; cbuf.put(tok, ram.peek); fifo.deq; ram.deq; endrule Rule exit and method entercan execute concurrently, if cbuf.putand cbuf.getTokencan execute concurrently http://csg.csail.mit.edu/6.375

  16. Concurrent FIFO methodspipelined FIFO rule foo (True); f.enq (5) ; f.deq; endrule make implicit conditions explicit  rule foo (f.notFull && f.notEmpty); f.enq (5) ; f.deq; endrule Can foo be enabled? f.notFull can be calculated only after knowing if f.deq fires or not, i.e. there is a combinational path from enable of f.deq to f.notFull Firing condition for rule foo has to be independent of the body http://csg.csail.mit.edu/6.375

  17. Concurrent FIFO methodsCF FIFO rule foo (True); f.enq (5) ; f.deq; endrule make implicit conditions explicit  rule foo (f.notFull && f.notEmpty); f.enq (5) ; f.deq; endrule Can foo be enabled? The firing condition for rule foo is independent of the body The FIFO in the IP lookup must therefore be CF http://csg.csail.mit.edu/6.375

  18. Completion buffer: Interface cbuf getResult getToken put (result & token) interfaceCBuffer#(type t); method ActionValue#(Token) getToken; method Action put(Token tok, t d); methodActionValue#(t) getResult; endinterface For no dead cycles cbuf.getTokenand cbuf.put and cbuf.getResultmust be able to execute concurrently http://csg.csail.mit.edu/6.375

  19. Completion buffer:Concurrency requirements cbuf getResult getToken put (result & token) For no dead cycles cbuf.getTokenand cbuf.put and cbuf.getResultmust be able to execute concurrently If we make these methods CF then every thing will work concurrently, i.e. (enter CF exit), (enter CF getResult) and (exit CF getResult) However CF methods are hard to design. Suppose (getToken< put), (getToken< getResult) and(put< getResult)then(enter <exit), (enter < getResult) and (exit < getResult) In fact, any ordering will work http://csg.csail.mit.edu/6.375

  20. I I V I V I iidx ridx cnt buf Completion buffer: Implementation A circular buffer with two pointers iidx and ridx, and a counter cnt Elements are of Maybe type modulemkCompletionBuffer(CompletionBuffer#(size)); Vector#(size, EHR#(Maybe#(t))) cb <- replicateM(mkEHR(Invalid)); Reg#(Bit#(TAdd#(TLog#(size),1))) iidx <- mkReg(0); Reg#(Bit#(TAdd#(TLog#(size),1))) ridx <- mkReg(0); EHR#(Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR(0); Integervsize = valueOf(size); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); rules and methods... endmodule http://csg.csail.mit.edu/6.375

  21. Completion Buffercont method ActionValue#(t) getToken() if(cnt[0]!==sz); cb[iidx][0] <= Invalid; iidx <= iidx==sz-1 ? 0 : iidx + 1; cnt[0] <= cnt[0] + 1; returniidx; endmethod methodAction put(Token idx, t data); cb[idx][1] <= Valid data; endmethod methodActionValue#(t) getResult() if(cnt[1] !== 0 &&&(cb[ridx][2] matchestagged (Valid .x)); cb[ridx][2] <= Invalid; ridx <= ridx==sz-1 ? 0 : ridx + 1; cnt[1] <= cnt[1] – 1; returnx; endmethod getToken < put put < getResult getToken < getResult Concurrency properties? http://csg.csail.mit.edu/6.375

  22. Rigid pipeline Linear pipeline Inefficient memory usage but simple design Efficient memory usage through memory port replicator Longest Prefix Match for IP lookup:3 possible implementation architectures Circular pipeline Efficient memory with most complex control Which is “best”? http://csg.csail.mit.edu/6.375 Arvind, Nikhil, Rosenband & Dave [ICCAD 2004]

  23. IP addr Replicated: BEST: MUX result IP addr MUX / De-MUX result FSM FSM FSM FSM FSM Counter MUX / De-MUX RAM RAM Implementations of Static pipelines Two designers, two results Each packet is processed by one FSM Shared FSM http://csg.csail.mit.edu/6.375

  24. Synthesis results Synthesis: TSMC 0.18 µm lib - Bluespec results can match carefully coded Verilog - Micro-architecture has a dramatic impact on performance - Architecture differences are much more important than language differences in determining QoR http://csg.csail.mit.edu/6.375 V = Verilog; BSV = Bluespec System Verilog

More Related