1 / 20

Block Design Review: Queue Manager and Scheduler

Block Design Review: Queue Manager and Scheduler. Amy M. Freestone Sailesh Kumar. V 1. Rsv (3b). Port (4b). Buffer Handle(24b). QM/Schd. Overview. Lookup. Hdr Format. Switch Tx. S W I T C H. Phy Int Rx. Key Extract. QM/Scheduler Function:

barny
Download Presentation

Block Design Review: Queue Manager and Scheduler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Block Design Review:Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

  2. V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM/Schd Overview Lookup Hdr Format Switch Tx S W I T C H Phy Int Rx Key Extract • QM/Scheduler • Function: • Enqueue and Dequeue from queues • Scheduling algorithm (5-ports, N queue per port, WDRR across queues) • Drop Policy • RR port scheduling, rate controlled • Memory Accesses: • SRAM: • Q-Array Reads and Writes • Scheduling Data Structure Reads and Writes • QLength Data Structure Reads and Writes • Queue weight, discard threshold, and port rates Reads • Retrieve Packet Length from Buffer Descriptor Reads Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) V: Valid Bit Frame Length (16b) Stats Index (16b)

  3. V 1 Rsv (3b) Port (4b) Buffer Handle(24b) Data Structures SRAM Queue length Head LW0-1 xxx Discard threshold Tail LW2 Pkt_Size (16b) xxx Count Weight quantum LW3-7 xxx Q params (Per queue) Q Descrpt. (Per queue) : : : : Buf. Descrpt. High level Cache Arch. Local memory (16 entries) CAM (16 entries) SRAM Q-array (16 entries) Enqueuer QID(20b) Qlen Valid Head Valid Tail Valid Queue id (20b) Queue head/tail/count Dequeuer Queue length Dequeuer Dequeuer : : : : Dequeuer Discard threshold Dequeuer Weight quantum Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) Frame Length (16b) Stats Index (16b)

  4. V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM/Schd Interface Lookup Hdr Format Switch Tx S W I T C H Phy Int Rx Key Extract • Scratch Ring Interface • For both ingress and egress • Threads used: 7 • Thread 0: Free list maintenance and initialization • Thread 1-5: Dequeue for port 0-4 • Thread 6: Enqueue for all 5 ports • Threads are synchronized after each round • A round enqueues up to 5 packets • Dequeues up to 5 packets, one for each port Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) V: Valid Bit Frame Length (16b) Stats Index (16b)

  5. Thread Synchronization Note that in the enqueue thread, signal A is not used, it is implemented Using a register which is set by thread 0 and reset by enqueuer

  6. Resource Usage • Local memory: 1512 bytes • #define PAR_CACHE_LM_BASE 0x0 • #define PORT_DATA_LM_BASE 0x100 • #define BBUF_FL_LM_BASE 0x1a8 • #define BBUF_LM_BASE 0x1fc • #define FL_LM_BASE 0x598 • SRAM • Queue descriptors (16B per queue) • Queue parameters (16B per queue) • Port rates (4B per port) • Free lists • Batch buffers • Enqueue: • 15 signals, 16 RD xfer, 10 WR xfer • Dequeue: • 9 signals, QM uses 4 RD xfer, 1 WR xfer. SCH used more xfers

  7. Local Memory Map (JDD, 4/1/08) 0x000 PAR Cache • Port Data Structure: • 0: Old Tail LM • 1: Old Tail SRAM • 2: head SRAM • 3: tail SRAM • 4: tail offset (first empty slot) • 5: nexthead LM • 6: LM (head|tail) • 7: unused 0x100 Port Data 0x1A7 0x1A8 Batch Buf FL 0x1FB 0x1FC Batch Buffers (21 * 44Bytes) 0x597 0x598 Free List (>=40 * 4Bytes) 0x680 residualResult written here Port Rate Control Data 0x690 Unallocated 0x9FF

  8. Data Consistency Precautions • Only one thread (dequeue or enqueue) reads in the queue parameters of a Queue • Flags are used to ensure that when thread x is reading in the Q param • thread y doesn’t read them • Also, thread y waits until thread x stores the data read into cache • Flags are stored in local memory • Three flags are used, (head valid, tail valid, and Q param valid) • Head valid implies dequeue thread has cached the Q descriptor • Tail valid implies enqueue thread has cached the Q descriptor • Both valid means, both head and tail are cached • Before a thread swaps out • Move relevant register contents (flags, queue length) into the local memory • After a thread resumes • Move relevant local memory data back to register • Cache contents are refreshed after every 4k iterations • Port rate in register are refreshed every 4k iterations

  9. Initialization • Thread 0 initializes all shared data-structure ??? • CAM and Q-array (cam_clear and Q-array empty) • Memory controller variables • Set SRAM Channel CSR to ignore cellcount and eop bit in the buffer handle • Local memory • Queue parameter cache (all zeroes) • Scheduling data structures (set by scheduler) • SRAM • Queue parameters (length, weight quantum, discard threshold) • Queue descriptors (all zeroes) • Port rates (as per token bucket) • Free list (set by free list macro) • Scheduling data structure (set by scheduler)

  10. Enqueue Thread • Operates in batch mode (5 packets at a time) • Read 5 requests from the scratch ring • Check CAM for the 5 queue ids read • If miss • Evict LRU entry (write back queue params and descr) • Read queue params from SRAM into cache • Read queue descriptor into Q-array • Update CAM • check for discard • If discard, call dl_drop_buf • If admit • Send enqueue command to Q-array • Check if queue was already active • If not call add_queue_to_tail • Update the queue length in cache • Write back queue length (in future may want to do less often)

  11. Dequeue Thread (per port) • One thread handles one port • Done for the round if port rate $$tx_q_flow_control is set or port is inactive (port_active macro) or tokens are over • If current batch is done, call get_head macro • If batch buffer is non-empty then consider the first queue_id • Check CAM for the queue_id • If miss • Evict LRU entry (write back queue params and descr) • Read queue params from SRAM into cache • Read queue descriptor into Q-array and Update CAM • If Hit or after data is ready • Send dequeue command to Q-array • Call dl_sink_1ME_SCR_1words • Read the pkt_length from buffer descriptor • Update queue length (and write back) and the credit • If credit <= 0 and queue_length > 0 then add_queue_to_tail • If queue_length <= 0 OR credit <= 0 then incr. batch_index • If batch_index = 5 OR queue_id = 0 then call advance_head

  12. Enqueue Thread Read 15 words from scratch 28 inst. 2x5 Writes 1, 3 words 2x5 Reads 3, 2 words For 5 q_ids, check CAM hit: If miss, write back LRU and read queue param/descriptor 40/31 inst. per Q 202/157 inst. total For all 5 requests: Worst case: 545+5x All discard: 395 All accept/hit: 500+5x Admit? dl_drop_buf() Per packet 41 if discard If admit: 62+add_q_2_tail Total 205 / 310+5x + 6 inst. for signals enqueue / update Q params 2x5 Writes 1, 1 word SCH reads Active? add_queue_to_tail() (x instr) x = 18-49 Write back the queue length Loop around

  13. Dequeue Thread (per port) 1 Read (once / 16K cycles) Rate_control 27 inst. If curr_queue = 0, get_head() 27 inst. 2 Writes 1, 3 words 2 Reads 3, 2 words Check CAM, evict, load 32/44 inst. Worst case: 320 Best: 170 Update cache, dequeue 24+ inst. 1 Read Send tx_msg, read pkt_len 34 inst. 1 Read Update credit/q_len, Wr q_len 13 inst. 1 Write Adv_head: 35-63 inst Add_queu..: 18-49 inst Overheads: 13 inst add_queue_to_tail() advance_head() Write_old_tail and loop around

  14. Dequeue Rate Control (Updated by JDD) • Token bucket • The unit of port_rate is bytes per 4096 clocks (ME clock/16 MHz). • curr_time is the counts of 16 clocks (ME clock/16 MHz). • last_time is the time when the last packet was sent. • IF PORT IS INAVTIVE THEN tokens = 4095 • ELSE IF (tokens = 4095) • SEND PACKET • last_time := curr_time • tokens = tokenspkt_length • ELSE • result = ((curr_time – last_time) x port_rate) + residualReslt // 16 x 16 multiply • residualResult = (result <<22) >> 22 // save bits shifted out to add back in next time • Tokens = min [ 4095, tokens + (result >> 10) ] • IF (tokens > 0) • SEND PACKET • last_time := curr_time • tokens = tokenspkt_length • Port rates • Must be specified in LSB 16-bits • 1 unit = 683 Kbps • Max port rate = 64K = 44.8 Gbps Reserved (16b) Port rate (16b)

  15. Performance Analysis • Dequeue thread runs much longer than the enqueue thread • Dequeue • 1273 cycles in case of a cache miss and add_queue_to_tail() and advance_head() • 867 cycles in case of cache hit and no scheduler calls • Enqueue • 876 cycles in case of all 5 cache misses • 342 cycles in case of a single enqueue and cache hit • Dequeue takes more time due to memory accesses • Read Queue_param: 110 cycles • Dequeue: 120 cycles • Read pkt_len: 110 cycles • There are few idle cycles at present • Can be removed by giving higher priority to dequeue threads

  16. File locations (in …/IPv4_MR/) • Code • src/qm/PL/common_macros.uc • src/qm/PL/dequeue.uc • src/qm/PL/enqueue.uc • src/qm/PL/fl_macros.uc • src/qm/PL/qm.h • src/qm/PL/qm.uc • src/qm/PL/sched_macros.h • Includes • ../dispatch_loop/dl_source_WU.uc • dl_buf_drop() and dl_sink_1ME_SCR_1words() functions • Also uses local memory read and write macros (localmem.uc)

  17. Queue Manager Validation • Tested • Threshold length discards (set length at 0, and tested if packets are enqueued) • Enqueue • Single port, single queue active • Multiple ports/queues active • Cache hit/miss (not all scenarios are tested) • Dequeue • Rate control partially tested (set the port rate at 0, and see is packet are dequeued) • Partial fairness test (set quantum at 0, and see if packets are dequeued) • Multiple active ports/queues • Both queue manager enabled • There is one bug concerning the Q-array contention

  18. Cycle Budget • 76B packet • 1.4 Ghz clock rate • 1.4Gcycle/sec • % Gbps => 170 cycles per packet • Dequeue worst-case = 320 inst. (best case 170 inst.) • Dequeue worst-case = 545 + 5x inst. for 5 packets

  19. Scheduling Structure Overview Head Next Head Tail Batch Buffer Batch Buffers in SRAM Batch Buffer Port 0 Batch Buffer Batch Buffer SRAM Next Pointer Queue 0 … … … … Credits 0 Port 4 Batch Buffer Batch Buffer Batch Buffer … Queue 4 Credits 4 Stack inSRAM Stack inLocal Memory Stack inLocal Memory Free List (for SRAM Batch Buffers) Batch Buffer Free List(for LM Batch Buffers)

  20. Scheduling Structure Interface • Scheduling structure macros contained in \src\qm\PL\sched_macros.uc • add_queue_to_tail(queue, credits, port) • get_head(port, head_ptr) • advance_head(port, sig_a, sig_b) • port_active(port, label) • write_old_tail(port, sig_a, sig_b) • Free list macro contained in\src\qm\PL\fl_macros.uc • maintain_fl()

More Related