1 / 55

ISCSI: Past, Present, Future

Very Brief History. Late 1997

pilialoha
Download Presentation

ISCSI: Past, Present, Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. iSCSI: Past, Present, Future Robert Russell Computer Science Department and IOL University of New Hampshire

    2. Very Brief History Late 1997 idea of storage over IP Julian Satran, IBM research Late 1999 IBM and Cisco start joint work on proposal for standard Early 2000 IETF creates IP storage working group November 2000 IETF draft 0 posted

    3. Rest of Brief History Jan 2001 SNIA creates IP storage forum July 2001 first UNH IOL iSCSI Plugfest 28 companies attended Tested drafts 0 and 6 Feb 2003 IETF approves draft 20 June 2003 Microsoft Server with iSCSI April 2004 IETF publishes RFC 3720

    4. Today December 2005 iSCSI products offered by all storage and platform vendors Many small vendors in the market iSCSI now well accepted at low and middle performance ranges 1 Gig wire-speed HBAs available 10 Gig iSCSI products starting to appear

    5. Other SAN Technologies Enterprise data centers still based on Fibre Channel (1 Gig, 2 Gig, soon 4 Gig) Renewed interest in iFCP and FCIP Will Fibre Channel equipment prices be lower in the near future?? InfiniBand

    6. SCSI Transport Protocol for TCP Based on widely used, off-the-shelf technology SCSI, TCP, IP, IPsec, Ethernet Familiar, already installed infrastructure Commodity components, inexpensive Permits all-software implementations Encourages experimentation, early feedback Many freely distributed implementations

    7. iSCSI Design Principles Target controls data transfer To enable fair sharing of resources To manage limited memory resources To improve disk performance Messages (PDUs) in both directions are sequenced and acknowledged In addition to TCP sequencing and acknowledging To maintain SCSI command ordering To detect errors To control data flow

    8. iSCSI: Text Negotiation Stylistic departure from FC, TCP, etc. Key=value Used for multiple purposes: Login, authentication, discovery, renegotiation Easy to use, understand, debug Slower to process, bigger messages Used mostly in Login sessions are long-lived Linux initiator now split between kernel/user

    9. iSCSI: Designed-in Extensibility Text keys and values Can carry info in both directions - slow Additional header segments (AHS) Can carry info from initiator to target - fast Asynchronous messages Can carry info from target to initiator - fast

    10. iSCSI: Error Handling End-to-end CRCs (digests) useful because TCP has weak checksum Stone and Partridge, ACM SIGCOMM2000 pp 309-319 TCP checksum observed to catch error in every 1 in 1100 to 1 in 32000 segments Error gets through TCP checksum to application every 1 in 6 million to 1 in 10 billion segments Markers embedded in stream little used 3 levels of recovery to deal with CRC errors and connection loss

    11. iSCSI: Error Recovery Complex, many choices of action to take Poorly tested, may hide bugs Why so complex? SCSI error recovery slow, crude Some applications require absolute accuracy Compromise after long discussion Philosophy repudiated by iSER/iWARP

    12. Draft Implementers Guide 3 clarifications no change to existing code Over/underflow, reserved ITT, format errors 2 corrections minor changes to existing code Interaction between R2Ts on same connection Handling data digest errors on Reject, Async messages

    13. Draft Implementers Guide 2 new additions minor changes to existing code Task management effecting multiple I_T Nexi New proposal now under discussion Reinstating unnamed discovery sessions To avoid interference with normal sessions To permit independent discovery sessions based on target addresses

    14. Relatively Unused Features Error recovery level 2 Out of order PDUs and/or PDU sequences Multiple connections (scheduling policies?) Use with IPsec (management) Bidirectional commands (only 1 in SBC-2) Additional header segments (AHS) Markers

    15. iSCSI: RFC 3720 Document Long, informal English prose Ambiguous, can be misinterpreted Testing is long, has many combinations Need for use of formal methods for specification, verification, testing Bishop et al., ACM SIGCOMM2005 pp 265-276

    16. iSCSI: Performance Factors Workload characteristics Sequential streaming vs random access Read/write, large/small transfers Network characteristics Speed (100, 1000, 10000 Mbps) Distance (LAN, MAN, WAN) Error rates Congestion

    17. iSCSI: Performance Metrics Bandwidth utilization high is desirable CPU utilization low is desirable Latency low is desirable Transaction rate high is desirable

    18. iSCSI: Performance Numerous studies done, many more to do Many, many tunable parameters at all levels SCSI iSCSI TCP Ethernet Interactions/tradeoffs within/between levels Dynamic parameter adjustment

    19. SCSI Initiator Parameters Maximum no. of outstanding commands Big enough to keep network pipeline full Maximum no. of sectors per command Big to allow multi-sector requests Maximum no. of I/O vectors per command Big to allow scatter/gather operations Coalescing contiguous blocks In order to reduce need for I/O vectors

    20. iSCSI: Tunable Parameters PDU size declared on initiator and target Usage determined independently by sender Big enough to keep pipeline full Out-of-order PDUs negotiate on/off Usage determined independently by sender May be useful when target sends DataIn PDUs May be bad when initiator sends DataOut PDUs

    21. iSCSI: Tunable Parameters Header/data digests negotiate on/off Used by both sides Catches errors that get through TCP checksum Error recovery level negotiate 0, 1 or 2 Used by both sides Higher levels give faster, smoother recovery Markers negotiate on/off and interval Used by each side independently Recovers PDU alignment in TCP stream

    22. iSCSI: Tunable Parameters Immediate/unsolicited data negotiate on/off and maximum Usage determined by initiator on writes only May reduce latency on small writes May increase buffering on target extra copy Multiple connections negotiate maximum Creation and usage determined by initiator Scheduling algorithms not yet explored

    23. iSCSI: Tunable Parameters Burst sizes negotiate max Usage determined independently by target Big enough to keep pipeline full Number outstanding R2Ts negotiate max Usage determined independently by target Big enough to keep pipeline full

    24. iSCSI: Tunable Parameters Phase collapse internal to target Eliminates extra response PDU from target Command window internal to target Controls load and buffer usage on target Big enough to keep pipeline full

    25. iSCSI: Tunable Parameters A-bit, DataAck SNACK negotiate ERL > 0 Usage determined independently by target on reads only Reduces buffering on target Out-of-order Sequences negotiate on/off Usage determined independently by target Reduces latency and buffering on target

    26. TCP: Tunable Parameters Maximum window sizes Bigger generally better Options for timestamps, window scaling, etc. Delayed, selective acknowledgements Nagle algorithm to coalesce small packets Turn off except when streaming small PDUs Dynamic packet coalescing Better control than Nagle on/off

    27. Ethernet: Tunable Parameters Jumbo frames Improves bandwidth utilization Decreases CPU overhead Not supported on all NICs, HBAs, switches Driver DMA input queue length Bigger to smooth out traffic bursts Interrupt coalescing Trades response time against CPU overhead

    28. Tradeoff Example: iSCSI CRC Use of TOE without iSCSI CRC off-loaded Reduces performance due to memory access Use of TOE with iSCSI CRC off-loaded Reduces protection due to bus crossing Use of TCP copy and iSCSI CRC in software Expensive, but performance better for small PDU

    29. iSCSI CRC in Software 2% reduction for PDUs less than 2 KB 31% reduction for PDUs bigger than 8 KB

    30. Graph of Parameter Interaction

    31. iSCSI: Parameter Relationship Let N = number of outstandingR2Ts Let M = MaxBurstLength (MRDSL) in KB Then at top of the knee in the graph, N x M = 64 The pipeline size at this latency Target controls N to keep pipeline full Formula needs additional factor for latency

    32. Equation for Write Throughput

    33. Calculated Coefficients A = 1.82 msec / R2T PDU B = 0.011 msec / DataOut PDU C = 115.29 msec / immediate MB D = 120.79 msec / unsolicited MB E = 87.72 msec / solicited MB F = 0 msec

    34. Write From UNH Initiator

    35. Write From Windows Initiator

    36. Memory a Critical iSCSI Resource Initiator Paging to an iSCSI disk VM system MUST NOT block for memory Without care, standard TCP stack will block for memory (buffers and control structures) Without care, iSCSI data path will block for memory Target memory starvation May get multiple commands at once Must hold memory until receipt acknowledged Acknowledgement may be delayed indefinitely Target must send NopIn or set A-bit on last DataIn

    37. iSCSI: CPU Load CPU utilization is not negligible Biggest percent from TCP/IP, not iSCSI or SCSI Standard TOE off-loading helps output iSCSI HBA off-loading helps input and output Software iSCSI CRC is expensive for large PDUs

    38. CPU Overhead without HBA Interrupt rate 1500 byte frame every 12 microsecs on 1 GE 9000 byte frame every 5 microsecs on 10 GE Frequent cache flushing Extra copying TOEs help mainly on output Input requires intermediate TCP buffers or costly memory mapping

    39. iWARP IETF Remote Direct Data Placement WG Suite of protocols RDMAP to control DDP coherently DDP to segment and place data directly MPA to align frames in TCP stream SCTP to bypass MPA/TCP, map DDP onto IP Implemented in RNIC RDMA-aware NIC For general use, not just iSCSI/iSER

    40. iSCSI: Stack With iSER/iWARP

    41. iWARP: RNIC Concepts Manages large transfers without host CPU interaction Fragments large transfers into TCP segments, each with extra headers Avoids copying at both ends of the wire Adds end-to-end CRC checking Adds markers to handle out-of-order frames

    42. iWARP: RNIC Benefits Substantially reduces host overhead Fewer host interrupts Once per transfer, not once per frame Fewer host cache flushes Less use of host memory space No network buffers in host memory Less use of host memory bus One direct transfer between wire and memory Better use of network bandwidth Lower network latency

    43. iWARP Details Untagged buffers for control frames 20-byte header plus 4-byte CRC Tagged buffers for data frames 16-byte header plus 4-byte CRC Uses IPsec for transmission security Many other security requirements for RNIC Error handling philosophy terminate the connection!

    44. iSER: iSCSI Extensions for RDMA Interface between iSCSI and RDMA iSER adds 12-byte header to control PDUs Makes iSCSI independent of any protocol RDMAP/DDP/MPA/TCP/IP RDMAP/DDP/SCTP/IP Infiniband Others? (Myrinet?, Quadrics?)

    45. iSER: Concepts Target controls data flow iSCSI read = target RDMA write iSCSI write = target RDMA read 4 new keys Old keys for digests, markers are irrelevant Handling of iSCSI PDUs R2T, DataOut PDUs replaced by RDMA read DataIn PDUs replaced by RDMA write All other PDUs carried by RDMA send

    46. iSCSI/iSER/iWARP Error Handling Guaranteed reliable, in-order deliver iSER/iWARP error terminates connection! All iSCSI error recovery levels possible Level 1 reduced to almost nothing Digest and sequence errors now impossible PDU retransmission timeouts discouraged SNACK must no longer be sent

    47. iSCSI: Sharing a Target Device Multiple hosts easily access common target Efficient block transport directly to disk No notion of files, directories, data, or metadata No contention detection or resolution No allocation or management of blocks

    48. Object-based Storage System Idea: push more intelligence onto disk unit Target manages block allocation Target defines objects and maps their blocks Target manages object metadata Enhancements to SCSI command set Must rewrite file systems to use objects

    49. ANSI Project T10/1355-D SCSI Object-Based Storage Device Commands Final Revision 10, 30 July 2004 To provide efficient operation of I/O logical units that manage the allocation, placement and accessing of variable-size data-storage containers called objects.

    50. iSCSI: Object-store All OSD commands are bi-directional 200-byte CDB requires use of iSCSI AHS Reading a PDU header requires 2 steps: Read 48-byte Basic header, extract AHSLength Read following AHSLength bytes Header digest (CRC) is problematic Use AHSLength value to read AHS headers and CRC Use CRC to check complete header after read done If AHSLength has error input may block!

    51. Tracking SCSI Standards iSCSI version 0 (RFC 3720) based on: SAM-2 Final revision 24, 11-September-2002 SBC Final revision 8, 13-November-1997 Upgrade to SAM-3 project T10/1561-D? Final revision 14, 21-September-2004 SAM-4 project T10/1683-D Current draft 3, 20-September-2005 Upgrade to SBC-2 project T10/1417-D? Final revision 16, 13-November-2004

    52. SCSI: SAM-3 Changes Task management command changes Async event notification removed Contingent allegiance removed Untagged tasks removed Task priority added

    53. iSCSI: Research Areas Dynamic parameter adjustment Respond to changes in application load Respond to changes in network conditions Using parameters between levels E.g., let iSCSI use TCPs RTT to keep pipe full Negotiate limits but operate at other values Target controls burst sizes, outstanding R2Ts Initiator controls connections, unsolicited data

    54. iSCSI: Research Areas Scheduling multiple connections in session What criteria to use? Let different connections carry different info? Reinstating sessions in order to renegotiate Only limited differences between connections New file systems New caching schemes

    55. iSCSI: Novel Uses Take advantage of extensibility features Use AHS to carry extra information with commands from Initiator to Target Use Async messages to carry extra information from Target to Initiator Use new text keys to exchange metadata Use multiple connections to carry different information

More Related