370 likes | 482 Views
File Transfer in Grids. Tiziana.Ferrari@cnaf.infn.it INFN – CNAF Corso di Laurea specialistica in Informatica Anno Acc. 2005/2006. Outline. PART I : FTP PART II : GridFTP References. Application. Application. Internet Protocol Architecture. Connectivity. Transport. Internet. Fabric.
E N D
File Transfer in Grids Tiziana.Ferrari@cnaf.infn.it INFN – CNAF Corso di Laurea specialistica in Informatica Anno Acc. 2005/2006 File Transfer in Grids
Outline • PART I: FTP • PART II: GridFTP • References File Transfer in Grids
Application Application Internet Protocol Architecture Connectivity Transport Internet Fabric Link File Transfer Protocol Grid Architecture Internet Architecture Collective Resource FTP and GridFTP File Transfer in Grids
FTP Server Web Server HTTP Protocol FTP Protocol Telnet Protocol TLS Protocol TCP Protocol TCP Protocol IP Protocol IP Protocol Network Enabled Services: the Protocol Stack • All services require protocols • Not all protocols are used to provide services in every context (e.g. IP, UDP, TLS, etc.) • Examples: FTP and Web servers File Transfer in Grids
Objectives of the Protocol • (FROM RFC 959, Oct 1985) The objectives of FTP are: • to promote sharing of files (computer programs and/or data), • to encourage indirect or implicit (via programs) use of remote computers, • to shield a user from variations in file storage systems among hosts (FTP, though usable directly by a user, is designed mainly for use by programs and consequently variations can be operated through FTP by programs), • to transfer data reliably and efficiently. File Transfer in Grids
History • FTP has had a long evolution over the years; the main steps are: • 1971 (RFC 114): the first proposed file transfer mechanisms in 1971 were developed for implementation on hosts at M.I.T. • 1972 (RFC 354): The File Transfer Protocol was now defined as a protocol for file transfer between HOSTs on the ARPANET, with the primary function of FTP defined as: • transferring files efficiently and reliably among hosts • allowing the convenient use of remote file storage capabilities. • 1973 (RFC 542): new “official” specification • 1980 (RFC 765): specification of FTP for use on TCP • 1985 (RFC 959): File Transfer Protocol, main specification • 2003: GridFTP: Protocol Extensions to FTP for the Grid • 2004: GridFTP v2 Protocol Description File Transfer in Grids
Terminology • FTP commands and replies • Commands comprise the control information flowing from the user-FTP to the server-FTP process: e.g. • CDUP - Change to Parent Directory / RMD - Remove Directory / MKD - Make Directory / PWD - Print Directory / and many others • Reply: an acknowledgment (positive or negative) sent from server to user via the control connection in response to FTP commands. The general form of a reply is a completion code (including error codes) followed by a text string. The codes are for use by programs and the text is usually intended for human users. • File: • an ordered set of computer data (including programs), of arbitrarylength, uniquely identified by a pathname. • Mode: • The mode in which data is to be transferred via the data connection. The mode defines the data format during transfer including EOR (End Of Record) and EOF (End of File). File Transfer in Grids
Terminology: PI and DTP • FTP consists of: • a protocol interpreter (PI) and • a data transfer process (DTP). • Protocol interpreter (PI): • the user and server sides of the protocol have distinct roles implemented in a user-PI and a server-PI • Data Transfer Process (DTP): • 1. opens and 2. manages the data connection. It can be either passive or active: • active: the data transfer process initiates a connection on the data port • passive: the data transfer process listens for the initiation of a connection. • DTP: • It sets up parameters for transfer and storage, • It transfers data on command from its PI. File Transfer in Grids
Terminology: connections • Control connection: • the communication path between user and server for the exchange of commands and replies. This connection follows the Telnet Protocol. • Data connection: • a full duplex connection (can be used in either directions also simultaneously) over which data is transferred, in a specified mode and type. • The data transferred may be a part of a file, an entire file or a number of files. • The path may be between: • server user, • server server • It does not need to exist all of the time. • The data port need not be in the same host that initiates the FTP commands via the control connection, but the user or the user-FTP process must ensure a "listen" on the specified data port • Ports: both the user and the server-DTPs have a default data port. • User-DPT default data port = control connection port; • Server-DPT default data port = control connection port - 1. • Non-default data ports can be negotiated File Transfer in Grids
Data Representation • Data is transferred from a storage device in the sending host to a storage device in the receiving host. Often it is necessary to perform certain transformations on the data because data storage representations in the two systems are different. • Network Virtual File System: a concept which defines a standard network file system with standard commands and pathname conventions. • For example, data storage representations exist of Network Virtual Terminal ASCII character (used in the telnet protocol): • five 7-bit ASCII characters, left-justified in a 36-bit word • four 9-bit characters in a 36-bit word. • It is desirable to convert characters into the standard NVT-ASCII representation when transmitting text between dissimilar systems. The sending and receiving sites would have to perform the necessary transformations between the standard representation and their internal representations. • There are two byte sizes of interest in FTP: • the logical byte size of the file: the byte size in which data is to be stored, • the transfer byte size used for the transmission of the data. The transfer byte size is always 8 bits. The transfer byte size is not necessarily the byte size in which data is to be stored in a system. File Transfer in Grids
Structures • FILE STRUCTURE • File structure is the default to be assumed if the STRUcture command has not been used. In file-structure there is no internal structure and the file is considered to be a continuous sequence of data bytes. • RECORD STRUCTURE • Record structures must be accepted for "text" files (i.e., files with TYPE ASCII or EBCDIC) by all FTP implementations. In record-structure the file is made up of sequential records. • PAGE STRUCTURE • To transmit files that are discontinuous, made of independent parts (pages) FTP defines a page structure. Files of this type are sometimes known as "random access files“. In these files there is sometimes other information associated with the file as a whole, e.g., via a file descriptor, or with a section of the file (e.g., page access controls), or both. In FTP, the sections of the file are called pages. To provide for various page sizes and associated information, each page is sent with a page header. File Transfer in Grids
Terminology: connections (1/2) • The control connection is used for the transfer of commands, which describe the functions to be performed, and the replies to these commands. Data transfer commands include: MODE, STRU, TYPE, etc. • Parameters: • data port • transfer mode: MODE command which specifies how the bits of the data are to be transmitted: • STREAM: data is transmitted as a stream of bytes, end of file is indicated by the sending host closing the data connection. • BLOCK: file is transmitted as a series of data blocks preceded by one or more header bytes. The header bytes contain: • count field (2 byte): data block in bytes, thus marking the beginning of the next data block • descriptor code (1 byte): last block in the file (EOF), last block in the record (EOR), restart marker, suspect data, etc. • COMPRESSED File Transfer in Grids
Terminology: connections (2/2) • representation type: • ASCII - default, • EBCDIC, • IMAGE – binary files, • LOCAL: logical bytes of the size specified by the obligatory second parameter “Byte size”, etc) • structure (FILE/RECORD/PAGE): • file system operation: store, retrieve, append, delete, etc. • TYPE and STRUcture commands: which are used to define the way in which the data are to be represented. • TYPE: the data representation type used for 1. data transfer and 2. storage. Type implies certain transformations between the time of data storage and data transfer. File Transfer in Grids
Terminology: Server Server-FTP process: • A process or set of processes which perform the function of file transfer in cooperation with a user-FTP process and, possibly, another server. • The functions consist of a protocol interpreter (PI) and a data transfer process (DTP). Server-PI: • "listens" on Port L for a connection from a user-PI • establishes a control communication connection. • receives standard FTP commands from the user-PI, sends replies, • and governs the server-DTP. Server-DTP • The data transfer process, in its normal "active" state, establishes the data connection with the "listening" data port. • It sets up parameters for transfer and storage, and transfers data on command from its PI. • The DTP can be placed in a "passive" state to listen for, rather than initiate a connection on the data port. Server-FTP User-FTP Client-PI: connect connect Server-PI: listen Server-PI Client-PI Open control Communication connection Client-PI FTP commands Server-PI replies Server-DTP (active) Open data connection Client-DTP: listen Server-DTP (active) transfer Client-DTP File Transfer in Grids
Terminology: User • user-FTP process: a set of functions which together perform the function of file transfer in cooperation with one or more server-FTP processes: • a protocol interpreter, • a data transfer process • a user interface: allows a local language to be used in the command-reply dialogue with the user. • user-PI: the user protocol interpreter • initiates the control connection from its port U to the server-FTP process, • initiates FTP commands, • governs the user-DTP if that process is part of the file transfer • user-DTP: "listens" on the data port for a connection from a server-FTP process. If two servers are transferring data between them, the user-DTP is inactive. File Transfer in Grids
The FTP Model 1: User-to-Server A user wishes to transfer files between two hosts, of which one is a local host. User-FTP User Interface User Server-FTP Server Protocol Interpreter 1. FTP Commands User Protocol Interpreter FTP Replies Server Data Transfer Protocol User Data Transfer Protocol 2. Data Connection File System File System File Transfer in Grids
The FTP Model 2: Server-to-Server A user might wish to transfer files between two hosts, neither of which is a local host. The user sets up control connections to the two servers and then arranges for a data connection between them. Control information is passed to the user-PI but data is transferred between the server data transfer processes. User-FTP and User-PI “C” 1. FTP Commands 2. FTP Commands FTP Replies FTP Replies Server-FTP “B” Server-FTP “A” 3. Data Connection File Transfer in Grids
Restart procedure • It protects users from gross system failures: • failures of a host, • an FTP-process, • the underlying network, etc. • The restart procedure is defined only for the block and compressedmodes of data transfer. It requires the sender of data to insert a special marker code in the data stream with some marker information. • The marker information has meaning only to the sender, but must consist of printable characters in the default or negotiated language of the control connection (ASCII or EBCDIC). • The marker could represent a bit-count, a record-count, or any other information by which a system may identify a data checkpoint. • The receiver of data, if it implements the restart procedure, would then: • mark the corresponding position of this marker in the receiving system, • return this information to the user. • in the event of a system failure, the user can restart the data transfer by identifying the marker point. File Transfer in Grids
Commands: examples • Access control commands: • User name (USER) • Password (PASS) • CHANGE WORKING DIRECTORY (CWD) • … • Transfer parameters commands • DATA PORT (PORT) • REPRESENTATION TYPE (TYPE) • FILE STRUCTURE (STRU) • TRANSFER MODE (MODE) • … • FTP service commands • RETRIEVE (RETR) • STORE (STOR) • ALLOCATE (ALLO) • RESTART (REST) • … File Transfer in Grids
Server and User-FTP User Server-FTP: a process or set of processes which perform the function of file transfer in cooperation with a user-FTP process and, possibly, another server. The functions consist of a protocol interpreter (PI) and a data transfer process (DTP) User-FTP User Interface Server-FTP Server Protocol Interpreter FTP Commands User Protocol Interpreter User-FTP: A set of functions including a protocol interpreter, a data transfer process and a user interface which together perform the function of file transfer in cooperation with one or more server-FTP processes. The user interface allows a local language to be used in the command-reply dialogue with the user. FTP Replies Server Data Transfer Protocol User Data Transfer Protocol Data Connection File System File System File Transfer in Grids
Data Transfer Protocol User User-FTP Data Transfer Protocol: establishes and manages the data connection, it can be passive (if the protocol waits for incoming connections) or active (if it requests the opening of a connection). Data Port: The passive data transfer process "listens" on the data port for a connection from the active transfer process in order to open the data connection. User Interface Server-FTP Server Protocol Interpreter FTP Commands User Protocol Interpreter FTP Replies Server Data Transfer Protocol User Data Transfer Protocol Data Connection File System File System File Transfer in Grids
Control and Data Connections User • Control connection: The communication path between the USER-PI and SERVER-PI for the exchange of commands and replies. This connection follows the Telnet Protocol. • It is based on the TCP protocol. • Data connection: A fullduplex connection over which data is transferred, in a specified mode and type. The data transferred may be: • a part of a file, • an entire file or • a number of files. • The path may be between a server-DTP and a user-DTP, or between two server-DTPs. • It is based on the TCP protocol. User-FTP User Interface Server-FTP Server Protocol Interpreter FTP Commands User Protocol Interpreter FTP Replies Server Data Transfer Protocol User Data Transfer Protocol Data Connection File System File System File Transfer in Grids
Commands and Replies User FTP Commands: a set of commands that comprise the control information flowing from the user-FTP to the server-FTP process. Reply: an acknowledgment (positive or negative) sent from server to user via the control connection in response to FTP commands. The general form of a reply is a completion code (including error codes) followed by a text string. The codes are for use by programs and the text is usually intended for human users. User-FTP User Interface Server-FTP Server Protocol Interpreter FTP Commands User Protocol Interpreter FTP Replies Server Data Transfer Protocol User Data Transfer Protocol Data Connection File System File System File Transfer in Grids
Set-up of an FTP Session (1/2) User 1. In this model, the user-protocol interpreter initiates the control connection. The control connection follows the Telnet protocol. User-FTP User Interface Server-FTP 2. The FTP commands specify the parameters for the data connection (data port, transfer mode, representation type, and structure) and the nature of file system operation (store, retrieve, append, delete, etc.). Server Protocol Interpreter FTP Commands User Protocol Interpreter FTP Replies Server Data Transfer Protocol User Data Transfer Protocol Data Connection File System File System File Transfer in Grids
Set-up of an FTP Session (2/2) User 3. The user-DTP or its designate should "listen" on the specified data port, and the server initiate the data connection and data transfer in accordance with the specified parameters. The data port need not be in the same host that initiates the FTP commands via the control connection, but the user or the user-FTP process must ensure a "listen" on the specified data port. User-FTP User Interface Server-FTP Server Protocol Interpreter FTP Commands User Protocol Interpreter FTP Replies Server Data Transfer Protocol User Data Transfer Protocol Data Connection File System File System File Transfer in Grids
PART IIGridFTP File Transfer in Grids
Grid Data Needs • Transfer of large amounts of data (petabytes or terabytes) between storage systems • Striping across multiple servers to improve performance • Network traffic load balancing • Access to large amounts of data (terabytes or gigabytes) by many geographically distributed applications and users for analysis, visualization, etc. Issues: • Lack of a common protocol to access data (only multiple incompatible APIs are available) • Authentication and authorization • Management of consistency between different replicas of the same file • Location of multiple file replicas • Selection of best file replica File Transfer in Grids
Requirements • Grid Security Infrastructure (GSI) and Kerberos support • Third-party control of data transfer (e.g. data exchange driver by schedulers) • Parallel data transfer: multiple TCP streams between two given end-points • Striped data transfer • Partial file transfer • Automatic negotiation of TCP buffer/window size • Support for reliable/recoverable data transfer • GridFTPextends standards with: additions to security extensions, partial file transfer, parallel/striped transfer, TCP buffer/window size tuning File Transfer in Grids
Grid Security Infrastructure and Kerberos • Authentication, integrity and confidentiality features are critical when transferring or accessing files. • RFC 2228 establishes a way to use the security on the control channel, but not on the data channel. An extension has been added to allow authentication on the data channel as well to prevent data from being hijacked. • GridFTP supports both GSI and Kerberos authentication. • User-controlled setting of various levels of data integrity and or/confidentiality on the data channel: • No authentication • Self authentication (for file transfer between two servers): • identity(remote DATA CONNECTION) will be equal to the identity(user which authenticated to the CONTROL CONNECTION) • Subject-name authentication (the identity of the remote data connection must match the supplied subject-name) File Transfer in Grids
Third-party control of data transfer and TCP buffer tuning • Third-party control: • in order to manage large data sets, it is necessary to provide third-party control of transfers between storage servers: GridFTP provides this capability by adding security to the existing third-party transfer capability defined in the FTP standard • Manual control of TCP buffer size: • In order to achieve optimize bandwidth with TCP/IP, the protocol needs support for automatic buffer size tuning • A specific command “SBUF” (set buffer size) is introduced • The “Autonegotiate buffer size” (ABUF) command allows the invocation of an algorithm to determine and set the TCP buffer size (any algorithm can be chosen) File Transfer in Grids
Parallel data transfer • On wide-area links, using multiple TCP streams can improve aggregate bandwidth over using a single TCP stream. • This is required both between a single client and a single server, and between two servers. • GridFTP supports this through extensions to the commands and the data channel. In particular: • It can be controlled how many parallel data connections may be established to each destination data node: fixed level and variable level (where the number of connections varies according to the network performance) • Starting-parallelism: the data connections are enabled by the server-ftp • Minimum-parallelism: the number of open data connections is reduced to the specified minimum • Maximum-parallelism: the number of open data connections is increased up to the specified maximum S1 D1 File Transfer in Grids
Striped Data Transfer • Data is partitioned across multiple servers, in order to improve aggregate bandwidth. There are one or more TCP streams between M network end-points on the sending side and N network end-points on the receiving side. • The end-point is called “data node”. • Layout option issued by the source data node: • Partitioned: a partitioned data layout is one where the data is distributed evenly on the destination data nodes. Only one contiguous section of data is stored on each data node. • Blocked: the data is distributed in round-robin fashion over the destination data nodes Data nodes S1 D1 Data nodes S2 D2 M N D3 File Transfer in Grids
Striped Data Transfer (cont) • New commands: • SPAS: Striped Passive • allows an array of host/port connections to be RETURNED • Multiple end-points (multihomed hosts or multiple hosts) participate in the transfer • SPOR: Striped Port • Allows an array of host/port connections to be SENT File Transfer in Grids
Partial file transfer and reliable data transfer • Partial file transfer: • Standard FTP requires the application to: • transfer the entire file, • Or the remainder of a file starting at a particular offset. • GridFTP introduces new FTP commands to support transfers of regions of a file. • Reliable data transfer: • Fault recovery methods for handling transient network failures, server outages etc. • The FTP standard includes basic features for restarting failed transfer that are not widely implemented. • GridFTP to these features, for example by supporting: • Automatic RETRY • RE-SCHEDULING of a transfer for a later time • Switching to ALTERNATE SOURCES by means of a replica catalog File Transfer in Grids
Integrated instrumentation (1/2) • The protocol calls for restart and performance markers to be sent back. It is not specified how often, but it would be important to give the possibility to specify this parameter • Restart: the command is issued by the client and it indicates the byte ranges whose transfer needs to be restarted. • Byte ranges which have been succesfully stored to disk are recorded and notified to the client; • Complete restart marker: a concatenation of all ranges received by the data server on the control channel. It is computed by the client by aggregating contiguous ranges. • The client requests a restart by specifying for what byte ranges the transfer needs to be restarted. 1. Byte ranges Client Data server 2. Restart (complete restart marker) File Transfer in Grids
Integrated instrumentation (2/2) • The performance response of a server is extended by adding the following pieces of information: • Timestamp: time at which the server computed the performance information; • Stripe index. The stripe index that the marker pertains to (for monitoring of striped FTP); • Stripe bytes transferred: the number of bytes which have been received on this stripe (for striped FTP monitoring); • Total stripe count: the total number of stripes participating in this transfer; • Transfer start time. File Transfer in Grids
References • File Transfer Protocol, RFC 959. • GridFTP: Protocol Extensions to FTP for the Grid; B.Allcock et alt.; GGF Recommendation, Apr 2003. File Transfer in Grids