E N D
1. iSCSI: Past, Present, Future Robert Russell
Computer Science Department and IOL
University of New Hampshire
2. Very Brief History Late 1997 idea of storage over IP Julian Satran, IBM research
Late 1999 IBM and Cisco start joint work on proposal for standard
Early 2000 IETF creates IP storage working group
November 2000 IETF draft 0 posted
3. Rest of Brief History Jan 2001 SNIA creates IP storage forum
July 2001 first UNH IOL iSCSI Plugfest
28 companies attended
Tested drafts 0 and 6
Feb 2003 IETF approves draft 20
June 2003 Microsoft Server with iSCSI
April 2004 IETF publishes RFC 3720
4. Today December 2005 iSCSI products offered by all storage and platform vendors
Many small vendors in the market
iSCSI now well accepted at low and middle performance ranges
1 Gig wire-speed HBAs available
10 Gig iSCSI products starting to appear
5. Other SAN Technologies Enterprise data centers still based on Fibre Channel (1 Gig, 2 Gig, soon 4 Gig)
Renewed interest in iFCP and FCIP
Will Fibre Channel equipment prices be lower in the near future??
InfiniBand
6. SCSI Transport Protocol for TCP Based on widely used, off-the-shelf technology
SCSI, TCP, IP, IPsec, Ethernet
Familiar, already installed infrastructure
Commodity components, inexpensive
Permits all-software implementations
Encourages experimentation, early feedback
Many freely distributed implementations
7. iSCSI Design Principles Target controls data transfer
To enable fair sharing of resources
To manage limited memory resources
To improve disk performance
Messages (PDUs) in both directions are sequenced and acknowledged
In addition to TCP sequencing and acknowledging
To maintain SCSI command ordering
To detect errors
To control data flow
8. iSCSI: Text Negotiation Stylistic departure from FC, TCP, etc.
Key=value
Used for multiple purposes:
Login, authentication, discovery, renegotiation
Easy to use, understand, debug
Slower to process, bigger messages
Used mostly in Login sessions are long-lived
Linux initiator now split between kernel/user
9. iSCSI: Designed-in Extensibility Text keys and values
Can carry info in both directions - slow
Additional header segments (AHS)
Can carry info from initiator to target - fast
Asynchronous messages
Can carry info from target to initiator - fast
10. iSCSI: Error Handling End-to-end CRCs (digests) useful because TCP has weak checksum
Stone and Partridge, ACM SIGCOMM2000 pp 309-319
TCP checksum observed to catch error in every 1 in 1100 to 1 in 32000 segments
Error gets through TCP checksum to application every 1 in 6 million to 1 in 10 billion segments
Markers embedded in stream little used
3 levels of recovery to deal with CRC errors and connection loss
11. iSCSI: Error Recovery Complex, many choices of action to take
Poorly tested, may hide bugs
Why so complex?
SCSI error recovery slow, crude
Some applications require absolute accuracy
Compromise after long discussion
Philosophy repudiated by iSER/iWARP
12. Draft Implementers Guide 3 clarifications no change to existing code
Over/underflow, reserved ITT, format errors
2 corrections minor changes to existing code
Interaction between R2Ts on same connection
Handling data digest errors on Reject, Async messages
13. Draft Implementers Guide 2 new additions minor changes to existing code
Task management effecting multiple I_T Nexi
New proposal now under discussion
Reinstating unnamed discovery sessions
To avoid interference with normal sessions
To permit independent discovery sessions based on target addresses
14. Relatively Unused Features Error recovery level 2
Out of order PDUs and/or PDU sequences
Multiple connections (scheduling policies?)
Use with IPsec (management)
Bidirectional commands (only 1 in SBC-2)
Additional header segments (AHS)
Markers
15. iSCSI: RFC 3720 Document Long, informal English prose
Ambiguous, can be misinterpreted
Testing is long, has many combinations
Need for use of formal methods for specification, verification, testing
Bishop et al., ACM SIGCOMM2005 pp 265-276
16. iSCSI: Performance Factors Workload characteristics
Sequential streaming vs random access
Read/write, large/small transfers
Network characteristics
Speed (100, 1000, 10000 Mbps)
Distance (LAN, MAN, WAN)
Error rates
Congestion
17. iSCSI: Performance Metrics Bandwidth utilization high is desirable
CPU utilization low is desirable
Latency low is desirable
Transaction rate high is desirable
18. iSCSI: Performance Numerous studies done, many more to do
Many, many tunable parameters at all levels
SCSI
iSCSI
TCP
Ethernet
Interactions/tradeoffs within/between levels
Dynamic parameter adjustment
19. SCSI Initiator Parameters Maximum no. of outstanding commands
Big enough to keep network pipeline full
Maximum no. of sectors per command
Big to allow multi-sector requests
Maximum no. of I/O vectors per command
Big to allow scatter/gather operations
Coalescing contiguous blocks
In order to reduce need for I/O vectors
20. iSCSI: Tunable Parameters PDU size declared on initiator and target
Usage determined independently by sender
Big enough to keep pipeline full
Out-of-order PDUs negotiate on/off
Usage determined independently by sender
May be useful when target sends DataIn PDUs
May be bad when initiator sends DataOut PDUs
21. iSCSI: Tunable Parameters Header/data digests negotiate on/off
Used by both sides
Catches errors that get through TCP checksum
Error recovery level negotiate 0, 1 or 2
Used by both sides
Higher levels give faster, smoother recovery
Markers negotiate on/off and interval
Used by each side independently
Recovers PDU alignment in TCP stream
22. iSCSI: Tunable Parameters Immediate/unsolicited data negotiate on/off and maximum
Usage determined by initiator on writes only
May reduce latency on small writes
May increase buffering on target extra copy
Multiple connections negotiate maximum
Creation and usage determined by initiator
Scheduling algorithms not yet explored
23. iSCSI: Tunable Parameters Burst sizes negotiate max
Usage determined independently by target
Big enough to keep pipeline full
Number outstanding R2Ts negotiate max
Usage determined independently by target
Big enough to keep pipeline full
24. iSCSI: Tunable Parameters Phase collapse internal to target
Eliminates extra response PDU from target
Command window internal to target
Controls load and buffer usage on target
Big enough to keep pipeline full
25. iSCSI: Tunable Parameters A-bit, DataAck SNACK negotiate ERL > 0
Usage determined independently by target on reads only
Reduces buffering on target
Out-of-order Sequences negotiate on/off
Usage determined independently by target
Reduces latency and buffering on target
26. TCP: Tunable Parameters Maximum window sizes
Bigger generally better
Options for timestamps, window scaling, etc.
Delayed, selective acknowledgements
Nagle algorithm to coalesce small packets
Turn off except when streaming small PDUs
Dynamic packet coalescing
Better control than Nagle on/off
27. Ethernet: Tunable Parameters Jumbo frames
Improves bandwidth utilization
Decreases CPU overhead
Not supported on all NICs, HBAs, switches
Driver DMA input queue length
Bigger to smooth out traffic bursts
Interrupt coalescing
Trades response time against CPU overhead
28. Tradeoff Example: iSCSI CRC Use of TOE without iSCSI CRC off-loaded
Reduces performance due to memory access
Use of TOE with iSCSI CRC off-loaded
Reduces protection due to bus crossing
Use of TCP copy and iSCSI CRC in software
Expensive, but performance better for small PDU
29. iSCSI CRC in Software 2% reduction for PDUs less than 2 KB
31% reduction for PDUs bigger than 8 KB
30. Graph of Parameter Interaction
31. iSCSI: Parameter Relationship Let N = number of outstandingR2Ts
Let M = MaxBurstLength (MRDSL) in KB
Then at top of the knee in the graph,
N x M = 64
The pipeline size at this latency
Target controls N to keep pipeline full
Formula needs additional factor for latency
32. Equation for Write Throughput
33. Calculated Coefficients A = 1.82 msec / R2T PDU
B = 0.011 msec / DataOut PDU
C = 115.29 msec / immediate MB
D = 120.79 msec / unsolicited MB
E = 87.72 msec / solicited MB
F = 0 msec
34. Write From UNH Initiator
35. Write From Windows Initiator
36. Memory a Critical iSCSI Resource Initiator Paging to an iSCSI disk
VM system MUST NOT block for memory
Without care, standard TCP stack will block for memory (buffers and control structures)
Without care, iSCSI data path will block for memory
Target memory starvation
May get multiple commands at once
Must hold memory until receipt acknowledged
Acknowledgement may be delayed indefinitely
Target must send NopIn or set A-bit on last DataIn
37. iSCSI: CPU Load CPU utilization is not negligible
Biggest percent from TCP/IP, not iSCSI or SCSI
Standard TOE off-loading helps output
iSCSI HBA off-loading helps input and output
Software iSCSI CRC is expensive for large PDUs
38. CPU Overhead without HBA Interrupt rate
1500 byte frame every 12 microsecs on 1 GE
9000 byte frame every 5 microsecs on 10 GE
Frequent cache flushing
Extra copying
TOEs help mainly on output
Input requires intermediate TCP buffers
or costly memory mapping
39. iWARP IETF Remote Direct Data Placement WG
Suite of protocols
RDMAP to control DDP coherently
DDP to segment and place data directly
MPA to align frames in TCP stream
SCTP to bypass MPA/TCP, map DDP onto IP
Implemented in RNIC RDMA-aware NIC
For general use, not just iSCSI/iSER
40. iSCSI: Stack With iSER/iWARP
41. iWARP: RNIC Concepts Manages large transfers without host CPU interaction
Fragments large transfers into TCP segments, each with extra headers
Avoids copying at both ends of the wire
Adds end-to-end CRC checking
Adds markers to handle out-of-order frames
42. iWARP: RNIC Benefits Substantially reduces host overhead
Fewer host interrupts
Once per transfer, not once per frame
Fewer host cache flushes
Less use of host memory space
No network buffers in host memory
Less use of host memory bus
One direct transfer between wire and memory
Better use of network bandwidth
Lower network latency
43. iWARP Details Untagged buffers for control frames
20-byte header plus 4-byte CRC
Tagged buffers for data frames
16-byte header plus 4-byte CRC
Uses IPsec for transmission security
Many other security requirements for RNIC
Error handling philosophy terminate the connection!
44. iSER: iSCSI Extensions for RDMA Interface between iSCSI and RDMA
iSER adds 12-byte header to control PDUs
Makes iSCSI independent of any protocol
RDMAP/DDP/MPA/TCP/IP
RDMAP/DDP/SCTP/IP
Infiniband
Others? (Myrinet?, Quadrics?)
45. iSER: Concepts Target controls data flow
iSCSI read = target RDMA write
iSCSI write = target RDMA read
4 new keys
Old keys for digests, markers are irrelevant
Handling of iSCSI PDUs
R2T, DataOut PDUs replaced by RDMA read
DataIn PDUs replaced by RDMA write
All other PDUs carried by RDMA send
46. iSCSI/iSER/iWARPError Handling Guaranteed reliable, in-order deliver
iSER/iWARP error terminates connection!
All iSCSI error recovery levels possible
Level 1 reduced to almost nothing
Digest and sequence errors now impossible
PDU retransmission timeouts discouraged
SNACK must no longer be sent
47. iSCSI: Sharing a Target Device Multiple hosts easily access common target
Efficient block transport directly to disk
No notion of files, directories, data, or metadata
No contention detection or resolution
No allocation or management of blocks
48. Object-based Storage System Idea: push more intelligence onto disk unit
Target manages block allocation
Target defines objects and maps their blocks
Target manages object metadata
Enhancements to SCSI command set
Must rewrite file systems to use objects
49. ANSI Project T10/1355-D SCSI Object-Based Storage Device Commands
Final Revision 10, 30 July 2004
To provide efficient operation of I/O logical units that manage the allocation, placement and accessing of variable-size data-storage containers called objects.
50. iSCSI: Object-store All OSD commands are bi-directional
200-byte CDB requires use of iSCSI AHS
Reading a PDU header requires 2 steps:
Read 48-byte Basic header, extract AHSLength
Read following AHSLength bytes
Header digest (CRC) is problematic
Use AHSLength value to read AHS headers and CRC
Use CRC to check complete header after read done
If AHSLength has error input may block!
51. Tracking SCSI Standards iSCSI version 0 (RFC 3720) based on:
SAM-2 Final revision 24, 11-September-2002
SBC Final revision 8, 13-November-1997
Upgrade to SAM-3 project T10/1561-D?
Final revision 14, 21-September-2004
SAM-4 project T10/1683-D
Current draft 3, 20-September-2005
Upgrade to SBC-2 project T10/1417-D?
Final revision 16, 13-November-2004
52. SCSI: SAM-3 Changes Task management command changes
Async event notification removed
Contingent allegiance removed
Untagged tasks removed
Task priority added
53. iSCSI: Research Areas Dynamic parameter adjustment
Respond to changes in application load
Respond to changes in network conditions
Using parameters between levels
E.g., let iSCSI use TCPs RTT to keep pipe full
Negotiate limits but operate at other values
Target controls burst sizes, outstanding R2Ts
Initiator controls connections, unsolicited data
54. iSCSI: Research Areas Scheduling multiple connections in session
What criteria to use?
Let different connections carry different info?
Reinstating sessions in order to renegotiate
Only limited differences between connections
New file systems
New caching schemes
55. iSCSI: Novel Uses Take advantage of extensibility features
Use AHS to carry extra information with commands from Initiator to Target
Use Async messages to carry extra information from Target to Initiator
Use new text keys to exchange metadata
Use multiple connections to carry different information