Scalable File I/O for Large-Scale Grid Applications

Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8, 1999

Compute File I/O Service Users /home Net I/O Conceptual Partition Model

File I/O Model • Support large-scale unstructured grid applications. • Manipulate single file per application, not per processor. • Support collective I/O libraries. • Require fast concurrent writes to a single file.

Problems • Need a file system NOW! • Need scalable, parallel I/O. • Need file management infrastructure. • Need to present the I/O subsystem as a single parallel file system both internally and externally. • Need production-quality code.

Approaches • Provide independent access to file systems on each I/O node. • Can’t stripe across multiple I/O nodes to get better performance. • Add a file management layer to “glue” the independent file systems so as to present a single file view. • Require users (both on and off Cplant) to differentiate between this “special” file system and other “normal” file systems. • Lots of special utilities are required. • Build our own parallel file system from scratch. • A lot of work just to reinvent the wheel, let alone the right wheel. • Port other parallel file systems into Cplant. • Also a lot of work with no immediate payoff.

Current Approach • Build our I/O partition as a scalable nexus between Cplant and external file systems. • Leverage off existing and future parallel file systems. • Allow immediate payoff with Cplant accessing existing file systems. • Reduce data storage, copies, and management. • Expect lower performance with non-local file systems. • Waste external bandwidth when accessing scratch files.

Building the Nexus • Semantics • How can and should the compute partition use this service? • Architecture • What are the components and protocols between them? • Implementation • What we have now and what we hope to achieve in the future?

Compute Partition Semantics • POSIX-like. • Allow users to be in a familiar environment. • No support for ordered operations (e.g., no O_APPEND). • No support for data locking. • Enable fast non-overlapping concurrent writes to a single file. • Prevent a job from slowing down the entire system for others. • Additional call to invalidate buffer cache. • Allow file views to synchronize when required.

Cplant I/O I/O I/O I/O I/O Enterprise Storage Services

Architecture • I/O nodes present a symmetric view. • Every I/O node behaves the same (except for the cache). • Without any control, a compute node may open a file with one I/O node, and write that file via another I/O node. • I/O partition is fault-tolerant and scalable. • Any I/O node can go down without the system losing jobs. • Appropriate number of I/O nodes can be added to scale with the compute partition. • I/O partition is the nexus for all file I/O. • It provides our POSIX-like semantics to the compute nodes and accomplishes tasks on behalf of the them outside the compute partition. • Links/protocols to external storage servers are server dependent. • External implementation hidden from the compute partition.

Compute -- I/O node protocol • Base protocol is NFS version 2. • Stateless protocols allow us to repair faulty I/O nodes without aborting applications. • Inefficiency/latency between the two partitions is currently moot; Bottleneck is not here. • Extension/modifications: • Larger I/O requests. • Propagation of a call to invalidate cache on the I/O node.

Current Implementation • Basic implementation of the I/O nodes • Have straight NFS inside Linux with ability to invalidate cache. • I/O nodes have no cache. • I/O nodes are dumb proxies knowing only about one server. • Credentials rewritten by the I/O nodes and sent to the server as if the the requests came from the I/O nodes. • I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI O2K as the (XFS) file server on the other end. • Don’t have jumbo packets. • Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each using about 15% of CPU.

Current Improvements • Put a VFS infrastructure into I/O node daemon. • Allow access to multiple servers. • Allow a Linux /proc interface to tune individual I/O nodes quickly and easily. • Allow vnode identification to associate buffer cache with files. • Experiment with a multi-node server (SGI/CXFS).

Future Improvements • Stop retries from going out of network. • Put in jumbo packets. • Put in read cache. • Put in write cache. • Port over Portals 3.0. • Put in bulk data services. • Allow dynamic compute-node-to-I/O-node mapping.

Looking for Collaborations Lee Ward 505-844-9545 lward@sandia.gov Pang Chen 510-796-9605 pchen@cs.sandia.gov

Scalable File I/O for Large-Scale Grid Applications

Scalable File I/O for Large-Scale Grid Applications

Presentation Transcript