530 likes | 575 Views
APIC Tutorial --- Architecture and Hardware John DeHart Washington University jdd@arl.wustl.edu http://www.arl.wustl.edu/~jdd. Coverage. APIC is a complicated device No way we can cover everything today.
E N D
APIC Tutorial --- Architecture and Hardware John DeHart Washington University jdd@arl.wustl.edu http://www.arl.wustl.edu/~jdd
Coverage • APIC is a complicated device • No way we can cover everything today. • in the original workshop we spent one whole day on the APIC architecture and hardware and a second day on the software • Lots more details in Zubin’s slides from the original workshop: • http://www.arl.wustl.edu/gigabitkits/kits.html • go to “Course Slides & Papers” in left margin • Also, papers and documentation from web site.
Our Original Goals for the APIC • Build a high speed ATM host interface • Single Chip • Low cost • High Bandwidth • Gigabit all the way to the application • Low Latency • Zero copy • Support for Quality of Service
APIC Features Overview • 32 bit and 64 bit PCI at 33MHz • All of our cards are 32 bit. • Point-to-Point, Multipoint and Loopback VCs • AAL5 Segmentation and Reassembly • AAL0: Raw ATM (RATM) • Support for multiple traffic types • Batching of cells in PCI Transaction • Control via PCI bus and remotely via control cells • Multiple DMA modes • Interrupts and Notification List for efficient interrupt handling • Flow Control: UTOPIA and ATM GFC field
Data Paths Control Paths APIC Internal Design Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
C A F D B E APIC Internal Design: 6 Clock Regions Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 A,B,C,D: Link Clocks (typically 62.5 MHz) E: Bus Clock (PCI: 33 MHz) F: Internal Clock (85 MHz) Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Transit Path: ATM Port ATM Port Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Receive Path: ATM Port Memory Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Transmit Path: Memory ATM Port Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Multipoint Receive Path: ATM Port * Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Multipoint Transmit Path: Memory * Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Loopback Path: Memory Memory Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Multipoint Loopback Path: Memory * Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
Data Paths Control Paths APIC Control and Response Cell Path Input Port Input Sync VC Trans- lation Table (VCXT) Output Sync Output Port Port 0 Port 0 . . . Utopia Ports Utopia Ports Cell Store Input Port Input Sync Output Sync Output Port Port 1 Port 1 Port 2 Port 2 Tx Sync Rx Sync Requestor Pacer Register Manager DataPath BusInterface Interrupt/ Notification Manager PCI-32/64 Bus
APIC and AALs • AAL5 • Frames up to 65535 bytes. • Used for IP Packets • Format on next slide • AAL0 • Host can send and receive individual ATM Cells • Used for: • communication with raw ATM devices • sending specially formatted control cells • APIC uses 56 byte cell format shown on a future slide.
Multiple of 48 Bytes 1 to 65535 bytes 0 to 47 1 1 2 4 AAL5 Frame Padding CRC Packet data Length Bytes User-to-User Reserved Length AAL5 Frames
APIC ATM pIn: Port In pOut: Port Out C: Control Cell L: Low Delay ChanId: Channel Id APIC AAL0 Header 31 24 16 8 0 pIn pOut C L ChanId AAL0 Frames One Cell 56 Bytes Internally, 56 bytes. When it goes out onto the ATM Link, of course it is 53 bytes AAL0 Frame 4 4 48
APIC Traffic Types • Transmit • Low Delay • highest priority • transmitted at link rate (APIC Global Pacing Rate) • Paced • transmitted at rate configured for channel • rates independently configurable for each channel • Best Effort • lowest priority • can use whatever bandwidth is left after low delay and paced channels • Receive • Low Delay • Strictly higher priority then Normal Delay • Normal Delay • Only serviced when all Low Delay queues are empty
... Empty Buffer Partially Filled Buffer Full Buffers APIC Descriptors and Buffers Current Descriptor • Buffer Descriptor points to a buffer queued for sending data from or receiving data into • Buffer Descriptor contains: • Address of buffer • physical address: PCI bus operates on physical not virtual memory • Buffer Length • Link to next descriptor • Flags
Buffer Details • Receive Buffers: • 8-byte aligned and a multiple of 8 bytes in length • CAVEAT: RX Sync Bug • AAL0 buffers should be multiple of 56 bytes in length • AAL5 buffers should be multiple of 48 bytes in length • Single AAL5 frame can span multiple buffers • No buffer can contain data from more then one AAL5 frame • EndOfFrame bit (E) set in buffer containing the last 8 bytes of the AAL5 frame. • with caveat above, this expands to be the last cell of the AAL5 frame • Multiple AAL0 frames can occupy the same buffer • Single AAL0 frame can span multiple buffers • BUT because of caveat above, this won’t happen. • Buffers for AAL0 will be completely filled
Buffer Details • Transmit Buffers: • Need not be aligned on word boundaries • But our drivers always do… • Can be of any length • Single AAL5 frame can span multiple buffers • No buffer can contain data from more than one AAL5 frame • EndOfFrame bit (E) set in buffer containing first byte of the last cell for the AAL5 frame. • Multiple AAL0 frames can occupy the same buffer • A single AAL0 frame can span multiple buffers • All buffers will be completely transmitted unless there is an error
Descriptor Details • All descriptors MUST reside in a block of contiguous physical memory, 1MB or less • All descriptors MUST be 16-byte aligned • APIC global register, descriptor area pointer register, must contain the address of this block of memory • Think of the descriptor area as an array of descriptors • nextDescOfs field in the descriptors is an index into the descriptor array • 16 bit index 65536 descriptors possible • 65536 descriptors * 16 bytes per descriptor = 1MB
Match/TCP_Checksum V I S O E C L X T Y BufLen NextDescOfs BufAddrLo (physical address) BufAddrHi (physical address) APIC Receive Descriptor • We’ll look at the Y field … • For more details, see Zubin’s original workshop slides
Match TCRC V I S O E T Y BufLen NextDescOfs BufAddrLo (physical address) BufAddrHi (physical address) APIC Transmit Descriptor • We’ll look at the Y field next … • For more details, see Zubin’s original workshop slides
Sync Bits (Y Field) of APIC Descriptor • Sync (Y) Bits: Implement Ready/Done • 0 DONE_VALIDLINK • APIC is finished with this descriptor and its link to the next descriptor is valid • 1 DONE_INVALIDLINK • APIC is done with this descriptor BUT its link to the next descriptor is not valid! • Be Careful of this one • 2 NOT_READY • Not ready for the APIC to use • The last descriptor in a chain is always marked NOT_READY by the driver • 3 READY • Ready for the APIC to use • Set in Receive Descriptors in a chain for APIC to use • Set in Transmit Descriptors that are ready for the APIC to send
APIC DMA Modes • Simple DMA • Separate queue of buffer descriptors for each connection • works well for transmit • Inefficient for receive • no sharing of receive buffers and descriptors • Pool DMA • multiple connections share a pool of buffer descriptors • works well for receive • caveat: one connection can use up all the buffer descriptors • obviously, does not work for transmit • Protected DMA • queueing operations executed by user-space driver • pair of descriptors associated with each buffer: • kernel descriptor • user descriptor • See details in Zubin’s original workshop slides.
APIC Interrupts and Notifications • Interrupts used to report an asynchronous event: • completion of transmission/reception of a frame • error condition • Interrupts can be enabled/disabled per channel • Notification List contains list of channels that have had events. • APIC issues an interrupt and disables further interrupts until processor re-enables. • subsequent events will just set an entry in notification list. • This reduces frequency of interrupts • This can also help reduce overhead of interrupt processing.
Global Registers (i.e. not per channel): 2 14 9 2 00 00000000000000 RegID 00 APIC Register Addresses • 27 bit address space • On PCI Bus, high order 5 bits are device select • These are programmed into the APIC PCI Configuration space at boot time by the BIOS
APIC Register Addresses (continued) Kernel Access Per-channel Registers: 2 8 8 9 2 10 t CID 00000000 RegID 00 User Access Per-channel Registers: 2 8 8 9 2 11 t CID 00000000 RegID 00 t=0 Rx Channel, t=1 Tx Channel CID: Channel Index or VCI
APIC Pacing: General Stuff • Pacing is for Transmit Channels only • Cells are NOT Paced out onto the wire • Not Exactly • Pacing is done on the PCI bus • Pacing is not a Guarantee, it is just a Restriction • Pacing Calculations include the ATM headers • But not the APIC header
APIC Pacing: General Stuff • Two pacer controls: • Global Pacing • APIC Pacing Parameter register (Global, 0x208) • Per VC Pacing • TX Channel Pacing Parameter Register (TX, 0x500XX68) • XX is the Channel ID • Three types of Channels: • Low Delay (Highest Priority) • Paced • Best Effort (Lowest Priority) • All channels are paced by the Global Pacing • Paced Channels also use Per VC Pacing
APIC Data Transfers • APIC pulls data from memory across the PCI bus in Batches of cells. • The number of cells in a Batch is controlled by a register • The Pacer identifies when it is time to transmit data and which connection should transmit • Pacer “wakes up” every 14 PCI Bus clock ticks • checks to see if it is time to transmit • Controlled by the Global APIC Pacing Parameter (APP) • If it is time to transmit, it takes the first connection off the previously sorted list of keys and transmits its data. • A lot of gory details about keys and heap storage of connections is not going to be included here. Read Rex’s documentation and/or read the VHDL if you want that level of detail
Global Pacing Parameter • Pacing parameters are 24 bits • 16 bits of Integer • 8 bits of fractional part • Global Apic Pacing Parameter (APP) (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) [Items in formula explained on next slide]
Explanation of Expression (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) • 256 : shifts left by 8 bits to set “decimal point” • BatchSz: How many cells per transfer • 53*8: Translate cells/second into bits/second • 8192, InternalClockMhz (85MHz), ClockEstimate • APIC counts how many of its internal 85MHz clock ticks take place during the time it takes for 8192 PCI bus clock ticks. This value is the ClockEstimate. • PCI Bus Clock Rate in MHz = (8192 * 85)/ClockEstimate • 14: # of PCI Bus Ticks in a Pacer Period • LinkRateMbps: Our target rate [Example on next 2 slides]
Example: Units in the APP Formula (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) (256 * Cells * Bytes/Cell * Bits/Byte * 8192 * M/sec) APP = -------------------------------------------------------- (14 * 1 * MBits/sec)
Example: APP for 1Gb/s Link Rate (256 * BatchSz * 53 * 8 * 8192 *InteralClockMhz) APP = -------------------------------------------------------- (14 * ClockEstimate * LinkRateMbps) • BatchSz=8 • 53*8: Translate cells/second into bits/second • InternalClockMhz = 85MHz • ClockEstimate = 20954 (typical value) • LinkRateMbps: 1000 (1000 Mb/s == 1Gb/s) (256 * 8 * 53 * 8 * 8192 * 85) APP = ---------------------------------------- = 2061.15 (14 * 20954 * 1000) APP = 2061 = 0x80D
Example: APP for 1Gb/s Link Rate APP = 2061 = 0x80D This means that every 14*8 = 112 PCI Bus clock ticks the APIC will be able to pull 8 Cells worth of data across the PCI Bus. (8 Cells)/(112 * 30ns) = (3392 bits)/(3360ns) ~= 1Gb/s
33 MHz PCI Bus Clock Count to 14 Count to APP Count to TX Channel Pacing Parameter This Tx Channel is Ready to Transmit BATCH Cells Per VC Pacing • Per VC Pacing Parameter • What portion of the full link rate can be used • e.g. an integer value of 2 means that this channel can use half the link rate • Conceptually like this:
Per VC Pacing vcPacingParameter ~ 10 One APIC Pacing Period current pacedTime Expired connections X X X X X X X time oldExpirationTime + vcPacingParameter newExpirationTime
pacedTime • pacedTime is incremented every global pacing cycle in which a non-LowDelay connection wins contention • Example with two connections: • (L) Low Delay at 1/24th of the global rate • (P) Paced at 1/6th of the global rate (.1666667) L L L L P P P P P P P P P P P P P P P 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84
L L L L P P P P P P P P P P P P P P P 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 pacedTime (continued) • We might expect the Paced channel to miss its exact turn and fire on the next global pacing interval but keep it next expiration on the (0,6,12,18,…) boundaries. • But…
L L L L 0 0 6 5 12 11 18 17 22 24 28 30 36 34 42 40 48 45 51 54 60 57 63 66 68 72 78 74 80 84 pacedTime (continued) • Actual rate for Paced connection: • (GlobalRate) * (3*(1/6) + 1*(1/7))/4 • (GlobalRate) * (.1607) • For a Global Rate of 24Mb/s (DQ test example) • 24 * .1607 = 3.8568 P P P P P P P P P P P P P P P t+ pacedTime t+ “Real” time
Example of a Pacing Oddity • Suppose we have a channel on which we are sending single cell packets at a rate of 2 cells every pacing period for that channel and the BATCH size is 1 cell so that the channel should only send 1 cell during each pacing period. D D D D D D D • You would expect the connection to build up a backlog, but it doesn’t……
Example of a Pacing Oddity (con’t) • Turns out the Driver does a RESUME each time it puts data in an empty transmit queue to restart it. • A RESUME causes the ExpireTime to be set to the current PacedTime. • This causes the channel to be expired at the very next Pacer Period. • Thus the channel transmits at twice its expected rate D T D T D T D T D T D T D T R R R R R R R
APIC Bugs and Caveats: RxSync • RxSync Lockup when buffers too short • APIC is receiving data for a connection. • APIC runs out of buffers when there is still data left • If this happens repeatedly, under certain conditions the APIC’s Rx-Sync module can lock-up. Example: if we have 3 16 byte buffers set up to receive one 56 byte AAL0 cell (re- member that the APIC AAL0 cell size is 56 bytes), then each time we receive a cell with these buffers we will have 8 bytes left over that the APIC SHOULD throw away. After the eighth time we use this chain of buffers to receive a cell, the APIC locks up. • A similar problem exists for AAL5. • Bug has not been identified in VHDL • Work- arounds: • For AAL0, always allocate buffers in multiples of 56 bytes. • For AAL5, always allocate buffers in multiples of 48 bytes.
APIC Bugs and Caveats: Word Swap • APIC swaps contiguous 32bit words when receiving data into host memory. • Exists in APIC when used in Intel architectures • Exists only in 32bit PCI mode • Bug has been identified in VHDL but we aren’t going to respin the chip… • Work-arounds: • Driver performs a word swap on all data received. • painful and costly data touch
APIC Bugs and Caveats: ILR • Bug in APIC decode of Interrupt Line Register address on writes • ILR is at 0x3C • BIOS writes IRQ value to ILR register and then reads it back to see if this is a functioning PCI device. If it doesn’t read back properly, it “removes” this device from the PCI bus • BIOS write to 0x3C enters APIC as write to 0x7C • reads of 0x3C are ok. • Bug has been identified in VHDL. • Work-around implemented on NICs and SPCs • you should never have to worry about this one…