1 / 15

Scalable File I/O for Large-Scale Grid Applications

This paper discusses the challenges and approaches for scalable and parallel file I/O in large-scale grid applications. It proposes a conceptual partition model for file I/O and presents a scalable nexus between the compute partition and external file systems. The paper also explores the current implementation, improvements, and future directions for the file I/O system.

rousej
Download Presentation

Scalable File I/O for Large-Scale Grid Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8, 1999

  2. Compute File I/O Service Users /home Net I/O Conceptual Partition Model

  3. File I/O Model • Support large-scale unstructured grid applications. • Manipulate single file per application, not per processor. • Support collective I/O libraries. • Require fast concurrent writes to a single file.

  4. Problems • Need a file system NOW! • Need scalable, parallel I/O. • Need file management infrastructure. • Need to present the I/O subsystem as a single parallel file system both internally and externally. • Need production-quality code.

  5. Approaches • Provide independent access to file systems on each I/O node. • Can’t stripe across multiple I/O nodes to get better performance. • Add a file management layer to “glue” the independent file systems so as to present a single file view. • Require users (both on and off Cplant) to differentiate between this “special” file system and other “normal” file systems. • Lots of special utilities are required. • Build our own parallel file system from scratch. • A lot of work just to reinvent the wheel, let alone the right wheel. • Port other parallel file systems into Cplant. • Also a lot of work with no immediate payoff.

  6. Current Approach • Build our I/O partition as a scalable nexus between Cplant and external file systems. • Leverage off existing and future parallel file systems. • Allow immediate payoff with Cplant accessing existing file systems. • Reduce data storage, copies, and management. • Expect lower performance with non-local file systems. • Waste external bandwidth when accessing scratch files.

  7. Building the Nexus • Semantics • How can and should the compute partition use this service? • Architecture • What are the components and protocols between them? • Implementation • What we have now and what we hope to achieve in the future?

  8. Compute Partition Semantics • POSIX-like. • Allow users to be in a familiar environment. • No support for ordered operations (e.g., no O_APPEND). • No support for data locking. • Enable fast non-overlapping concurrent writes to a single file. • Prevent a job from slowing down the entire system for others. • Additional call to invalidate buffer cache. • Allow file views to synchronize when required.

  9. Cplant I/O I/O I/O I/O I/O Enterprise Storage Services

  10. Architecture • I/O nodes present a symmetric view. • Every I/O node behaves the same (except for the cache). • Without any control, a compute node may open a file with one I/O node, and write that file via another I/O node. • I/O partition is fault-tolerant and scalable. • Any I/O node can go down without the system losing jobs. • Appropriate number of I/O nodes can be added to scale with the compute partition. • I/O partition is the nexus for all file I/O. • It provides our POSIX-like semantics to the compute nodes and accomplishes tasks on behalf of the them outside the compute partition. • Links/protocols to external storage servers are server dependent. • External implementation hidden from the compute partition.

  11. Compute -- I/O node protocol • Base protocol is NFS version 2. • Stateless protocols allow us to repair faulty I/O nodes without aborting applications. • Inefficiency/latency between the two partitions is currently moot; Bottleneck is not here. • Extension/modifications: • Larger I/O requests. • Propagation of a call to invalidate cache on the I/O node.

  12. Current Implementation • Basic implementation of the I/O nodes • Have straight NFS inside Linux with ability to invalidate cache. • I/O nodes have no cache. • I/O nodes are dumb proxies knowing only about one server. • Credentials rewritten by the I/O nodes and sent to the server as if the the requests came from the I/O nodes. • I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI O2K as the (XFS) file server on the other end. • Don’t have jumbo packets. • Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each using about 15% of CPU.

  13. Current Improvements • Put a VFS infrastructure into I/O node daemon. • Allow access to multiple servers. • Allow a Linux /proc interface to tune individual I/O nodes quickly and easily. • Allow vnode identification to associate buffer cache with files. • Experiment with a multi-node server (SGI/CXFS).

  14. Future Improvements • Stop retries from going out of network. • Put in jumbo packets. • Put in read cache. • Put in write cache. • Port over Portals 3.0. • Put in bulk data services. • Allow dynamic compute-node-to-I/O-node mapping.

  15. Looking for Collaborations Lee Ward 505-844-9545 lward@sandia.gov Pang Chen 510-796-9605 pchen@cs.sandia.gov

More Related