GPFS, Glamour and Panache Towards a Global Scalable File Service

GPFS, Glamour and PanacheTowards a Global Scalable File Service Renu Tewari IBM Almaden

World-wide Sharing of Data • Example: grid computing workflows • Massive amounts of data gathered at remote locations • Distributed compute resources process input data and store their results • Results are analyzed or visualized by people at many locations Ingest Compute Visualize

What is the function of a Filesystem • Namespace • Consistency • Security and Access control • Fault Tolerance • Performance

How do we get to a Scalable Global FS • Scale Up • Enterprise class storage systems • Large parallel machines • Scale Out • Internet companies and “cloud computing” are clearly embracing scale-out HW architectures • Large clusters of commodity servers connected by high-bandwidth Ethernet • For many applications, scale-out has proven to be cheaper than scale-up • Question of network bandwidth compared to disk bandwidth

System Scaling Up: BlueGene/L System 64 Racks, 64x32x32 131,072 processors Rack 32 Node Cards 2048 processors Node Card 360 TF/s 32 TB 1.3M Watts 16 compute cards (16 compute, 0-2 IO cards) 64 processors 5.6 TF/s 512 GB 20 KWatts 1 m2 footprint Compute Card 2 chips 4 processors 180 GF/s 16 GB Chip 2 processors 11.2 GF/s 1.0 GB 5.6 GF/s 4 MB Coming Soon: BlueGene/P BlueGene/C BlueGene/Q

Technology keeps growing … • Technology blurs distinction among SAN, LAN, and WAN • “Why can’t I use whatever fabric I have for storage?” • “Why do I need to separate fabrics for storage and communication?” • “Why can’t I share files across my entire SAN like I can with Ethernet?” • “Why can’t I access storage over long distances?”

Outline • GPFS • Scaling the “local” filesystem for a large data store • pNFS • Scaling the pipe for remote data access • WAN Caching • Bringing data closer to where it is consumed • Federated filesystems • Separating the physical location from the logical namespace

GPFS

Parallel File Systems 101 • Manage data sharing in large data stores • Two architectures: • Asymmetric • E.g., PVFS2, Lustre, High Road • Symmetric • E.g., GPFS, GFS, Polyserve

GPFS File System Nodes Switching fabric (System or storage area network) Shared disks (SAN-attached or network block device) GPFS: parallel file access Parallel Cluster File System Based on Shared Disk (SAN) Model • Cluster – fabric-interconnected nodes (IP, SAN, …) • Shared disk - all data and metadata on fabric-attached disk • Parallel - data and metadata flows from all of the nodes to all of the disks in parallel under control of distributed lock manager.

What GPFS is NOT Not a client-server file system like NFS, DFS, or AFS: no single-server bottleneck, no protocol overhead for data transfer • Not like some SAN file systems (e.g. Veritas, CXFS): no distinct metadata server (which is a potential bottleneck)

GPFS Features High capacity: • Large number of disks in a single FS High BW access to single file • Large block size, full-stride I/O to RAID • Wide striping – one file over all disks • Small files spread uniformly • Multiple nodes read/write in parallel High availability • Nodes: log recovery restores consistency after a node failure • Data: RAID or internal replication • On-line management (add/removedisks or nodes without un-mounting) • Rolling software upgrade Single-system image, standard POSIX interface • Distributed locking for read/write semantics Advanced file system features • Memory mapped files • Direct I/O • Replication • Snapshots • Quotas • Extended attributes

Distributed Locking • GPFS architecture: • Clients share data data and metadata • ... synchronized by means of distributed locking • Distributed locking essential to … • synchronize file system operations for POSIX semantics, • synchronize updates to file system metadata on disk to prevent corruption, • maintain cache consistency of data and metadata cached on different nodes. • Synchronization requires communication … • Problem: sending a lock message for every operation will not scale. • Solution: Token-based lock manager allows“lock caching”.

Token-based Lock Manager • Token server grants tokens. • Token represents right toread, cache, and/or updatea particular piece of dataor metadata. • Single message to tokenserver allows repeatedaccess to the same object. • Conflicting operation onanother node will revokethe token. • Force-on-steal: dirty data & metadata flushed to disk when token is stolen. • Numerous optimizations: • Large byte ranges • Lookup/open

Applications GPFS File structures Locks Tokens Foo.1 Foo.1 Foo.2 Foo.3 Data buffers Blk.02 Block 2 Blk.19 Blk.936 Blk.237 Block 237 Consistency control: Locks and Tokens Client systems Token Servers • Request / release • Revoke • Multiple modes • Distributed via hash • Recoverable service Local consistency Cached capability, global consistency

Distributed locking: optimizations • Byte range locking • Opportunistic locking: small “required” range and large “requested” range • Minimizes lock traffic for piecewise sharing • Metadata token optimizations • During lookup, client speculatively acquires tokens to open the file • Inode reuse: node deleting a file caches its inode for subsequent create • Directory scans: readdir() “prefetches” inode tokens for subsequent stat() • “metanode” dynamically assigned to handle metadata updates when write-sharing occurs • Block allocation • Segmented allocation map allows each node to allocate space on all disks for data striping • New optimizations for multi-cluster scaling • Token manager split across multiple nodes (hashed tokens) • Metanode dynamically migrated to minimize communication latency

Performance • 130+ GB/s parallel read/write on achieved on ASCII Purple • 12 GB/s parallel read/8 GB/s write on 40-node IA64 Linux cluster with 120 TB SAN-attached FAStT 600 • 7.8GB/s streaming read, 7.3GB/s write to single file on p690 • 15.8 GB/s read/14.5 GB/s write aggregate to multiple files • 1000+ files/sec/node (3051 files/sec on 3 Linux nodes, 50 drives on one FAStT controller (256M files in 24 hours 26 mins).

The Largest GPFS systems

GSA Experience August, 2006 142TB total space

GPFS on ASC Purple/C Supercomputer • 1536-node, 100 TF pSeries cluster at Lawrence Livermore National Laboratory • 2 PB GPFS file system (one mount point) • 500 RAID controller pairs, 11000 disk drives • 130 GB/s parallel I/O measured to a single file (135GB/s to multiple files)

GPFS Directions • Petascale computing work (HPCS) • File systems to a trillion files • Space to exabytes • Clusters to 50K nodes • Improved robustness with increasing numbers of components • Data Integrity • Fixing more than 2 failures • Covering the entire data path (disks, controllers, interconnect) • Storage management • Extensions for content managers, intelligent data store facilities • File Serving • pNFS (Parallel NFSv4) • Metadata intensive workload improvements • Large cache exploitation • Distributed file systems • Disk-based caching at remote sites • Independent, disconnected operation (even writes) • Multi-update reconciliation • Multi-site replication for availability • Automated operations improvements • Improved installation process • Automatic setting of most performance control knobs • GUI for management overview

GPFS beyond the Cluster • Problem: In a large data center, multiple clusters and other nodes need to share data over the SAN • Solution: eliminate the notion of a fixed cluster • “control” nodes for admin, managing locking, recovery, … • File access from “client” nodes • Client nodes authenticate to control nodes to mount a file system • Client nodes are trusted to enforce access control • Clients still directly access disk data and metadata

Multi-cluster GPFS and Grid Storage • Multi-cluster supported in GPFS • Nodes of one cluster mount file system(s) of another cluster • Tight consistency identical to that of a single cluster • User ID’s mapped across clusters • Server and remote client clusters can have different userid spaces • File userids on disk may not match credentials on remote cluster • Pluggable infrastructure allows userids to be mapped across clusters • Multi-cluster works within a site or across a WAN • Lawrence Berkeley Labs (NERSC) • multiple supercomputer clusters share large GPFS file systems • DEISA (European computing grid) • RZG, CINECA, IDRIS, CSC, CSCS, UPC, IBM • pSeries and other clusters interconnected with multi-gigabit WAN • Multi-cluster GPFS in “friendly-user” production 4/2005 • Teragrid • SDSC, NCSA, ANL, PSC, CalTech, IU, UT, ….. • Sites linked via 30 Gb/sec dedicated WAN • 500TB GPFS file system at SDSC shared across sites, 1500 nodes • Multi-cluster GPFS in production 10/2005

Remote Data Access Problem:Gap Between Users and Parallel File Systems Q: Use an existing parallel file system? A: Not likely Q: Build “yet another” parallel file system? A: See above

Possible Solutions:Heterogeneous and Remote Data Access  

pNFS, NFSv4.1

Why not vanilla NFS • NFS is a frequently used for remote data access • Pros • Simple, reliable, easy to manage • Practically every OS comes with a free NFS client • Lots of vendors make NFS servers • Cons • NFS “loose consistency” not suitable for some applications • Capacity and bandwidth don’t scale • Attempts to make NFS scale make it unreliable and difficult to manage • Single-server bottleneck (capacity and throughput limitations) • To overcome bottleneck, user has to explicitly partition namespace across servers • Partitioning and re-partitioning is tedious and hard to get right • Various attempts to scale conventional NFS, each with limitations • Clustered NFS servers (NFS on top of a cluster file system) • Each server handles all data, but each client only mounts from one server

NFSv4 Optimized for WAN • Compound ops • Redirection support • Delegations

Model for Remote Data Access over NFS

Parallel NFS (pNFS) – extending NFS to HPC • Include pNFS in NFSv4.1 to support parallel access • Minor version extension of NFSv4 protocol • NFSv4 client mounts from pNFS-enabled NFSv4 metadata server • Non-pNFS-enabled client gets data directly from metadata server • pNFS-enabled client is redirected to (multiple) data servers that store the file • Benefits: • Individual high-performance client can benefit from striping across multiple data servers, e.g. 10g client with 1g data servers (or servers with few disks) • Multiple clients can use data servers like a parallel file system • Still NFS semantics, vs. most parallel file systems which support Posix • Three variants of protocol • File: data servers are NFSv4 file servers • Object: data servers are ANSI T10 OSD servers • Block: data servers are block devices (Fibrechannel FCP, iSCSI)

pNFS Timeline ICS06 pNFS Small Writes May-Aug pNFS/PVFS2 Prototype @ SNL Planned Linux Kernel Acceptance SC06 pNFS/PVFS2 prototype demonstration Sept. PDSI Est. June Object draft 2003 2004 2005 2006 2007 2008 Connectathon/ Bakeathon Interop. Testing: Dec. NEPS Workshop @ CITI Dec. - NFSv4.1 draft - Block draft HPDC07 Direct-pNFS MSST05 pNFS published

pNFS protocol extensions to NFSv4.1 • The basic pNFS architecture: • Clients call metadata server to create/delete/open/close • Clients get layout (map of data) from metadata server • Layout directs clients to data servers for read/write operations • pNFS protocol operations: • GETDEVICELIST, • retrieve device list • GETDEVICEINFO – • retrieve information for a single device • get data server identities and attributes • LAYOUTGET(file_handle, type, byte_range) – asks server for “layout” that tells client how to map file byte ranges to data servers • Retrieves data layout information • Layout valid until returned or recalled by server • LAYOUTRETURN(file_handle, byte_range) – tells server that client is no longer accessing byte range; server can release state • LAYOUTCOMMIT(file_handle, byte_range, attributes, layout) – tells server to update metadata to make data visible to other clients • update eof and block status • CB_LAYOUTRECALL – server tells client to stop using a layout • CB_RECALLABLE_OBJ_AVAIL – tells client that delegation is now available for a file for which it was not previously available

SGI Linux AIX pNFS Architecture Parallel FS Storage Nodes NFSv4.1 Clients Parallel FS I/O NFSv4.1 Metadata Parallel FS Management Protocol NFSv4.1 Server/ Parallel FS Metadata Server

[MSST05] pNFS Internal Architecture

pNFS Discussion • Heterogeneity • High layout driver development cost • Reduce costs with standard storage protocols • Transparency • File system semantics • Mix of NFSv4 and underlying file system • E.g., pNFS/PVFS2 prototype caches only metadata • Security • Storage protocol dependent • Restrict WAN capability • Validation of opaque layout design • Negligible layout retrieval overhead • Maintains file system independence • pNFS client utilize multiple storage protocols simultaneously • Underlying file system must manage layout information • Layout driver enables scalability and performance • Avoid indirection penalty and single server bottleneck • Escape NFSv4 block size restrictions

[ICS06] pNFS Small Write Performance • Scientific I/O workloads • 90% of requests transfer 10% of data • Gap between small and large I/O requests • File system characteristics • Distributed file systems • Asynchronous client requests • Client write back cache • Server write gathering • Parallel file systems • Large transfer buffers limits asynchrony • Write-through or no cache • Use mixture of distributed and parallel storage protocols • Increase overall write performance

pNFS performance scaling • Look at pNFS peformance in two scenarios: • High-performance (e.g. 10GbE) client driving a server cluster • Parallel/cluster clients driving a server cluster • Test runs on small GPFS cluster at CITI • Single 10GbE client accessing server cluster of varying size • iozone -aec -i 0 -i 1 -+n -r 256K -s 1G -f /mnt/nfs4/io1 -U /mnt/nfs4 • Varying number of 1GbE clients accessing 1GbE server cluster • mpirun -np {1-3} -machinefile /tmp/gpfs.nodes /nas/test/IOR -b 1g -t 32k -g -e -F -o /mnt/ior/t1

Accessing Remote Data: Approaches • Admin controlled Replication • Requires grouping data apriori and then triggering replication • Replicas need to be kept in sync with master copy • Replica update done periodically or admin driven • Explicit file transfer, e.g., Grid FTP • Requires keeping track of multiple copies of the data • Requires planning ahead (all data must be copied in advance) • New data has to get migrated back • File access over the WAN, e.g., GPFS Multicluster over Teragrid • True shared file system • One copy of the data • Updates propagated and visible instantly • Requires high-bandwidth, low-latency WAN • Requires constant connection between sites, e.g., for the duration of a parallel job

WAN Caching

WAN Caching: The Wish List • Scalable cache for high performance applications • Single namespace (not absolutely necessary but convenient) • Files moved transparently upon access (or pre-staged, e.g. with “touch”) • Files can be stored on disk at cache site for as long as they are needed • Files automatically moved back as they age out of cache or are needed elsewhere • Possible (and desirable) to support disconnected operation • Cache “browsable” by client • Local ‘ls’ should show the list of files as the remote • Cache consistent both locally across nodes and with remote system

Panache: Key Design Points • A clustered cache embedded in GPFS caches data from a remote cluster • Applications • Grid computing - allowing data to move transparently during grid workflows • Facilitates content distribution for global enterprises • “Follow-the-sun” engineering teams • Key features • Cluster-wide cache designed for scalability and performance • Integrated with GPFS for consistent and concurrent data access from all nodes of the cluster • No special WAN protocol or hardware • Uses soon to be industry standard pNFS (NFSv4.1) protocol for parallel access • No additional hardware at both ends as required by WAFS vendors

Panache Architecture • Completely transparent to GPFS clients and applications • Application can run on local cache with same view as if it was running on remote cluster • Any client can browse the cache as a traditional filesystem • Cache presents a POSIX compliant view of the cached data • Cache data can be re-exported over NFS • pNFS for parallel data transfer over WAN • Cache consistency is “tunable” • NFSv4 file/dir delegations for across the WAN consistency • Supports disconnected operations • Resync after reconnect with policy for conflict resolution • Supports policy for pre-fetching and pre-populating the cache along with on-demand access • Supports whole-file or partial file caching • Plan for P2P style sharing across caches pNFS over WAN Application Nodes pNFS Clients pNFS Data Servers Metadata Server Local GPFS cache site Remote cluster

Federated Filesystems

Project Glamour • An architecture for filesystem virtualization • Decouple the file from the physical fileserver/filesystem • Unified “file-based” view of storage across an enterprise • Consolidate the “fragmented view” of heterogeneous independent fileservers • Could be distributed across multiple geographical locations • Simplify management • Single point of access for setting policies, control, capacity management

Approaches to Filesystem Virtualization • Layer-7 gateway/proxy appliance • In the data path can limit scalability • Uses standard protocols (including NFSv3) • Works with existing file servers and NAS appliances • E.g., NeoPath, Acopia • Server redirection • Requests forwarded among backend servers using proprietary protocols • Vendor lockin • Does not work with existing/heterogeneous systems • Can be optimized for performance • E.g., Spinnaker, Panasas • Client redirection • Redirect clients to the server with required data • Needs protocol support (CIFS/NFS v4 only) • Does not work with v3 clients • Works with standard v4, CIFS clients • Works well for distributed clients and servers • Can be very scalable • E.g., MS-Dfs, Glamour, NuView

Almaden bob alice glamor glamor papers talk tmp src bin bin src src Haifa Watson Almaden Watson exports local /home/alice as /users/home/alice Glamour: Common Namespace Replicates: Almaden:/usr/local/glamor as /storage/proj/backup/glamor export local /usr/local/glamor as /storage/proj/latest/glamor Watson / exports local /home/bob as /users/home/bob Tools users storage proj home generic CMVC SVN Haifa backup latest

Architecture Admin Tools (CLI,GUI, etc) Glamour admin Namespace LDAP Server Client Interface Glamor LDAP Plugin Command Central LDB ( Open LDAP) Slapd Namespace DB LDB_LDAP Glamour Schema Data Server Interface NIS Map Schema RNFS Schema GPFS Client NFSv4 Server NFSv3 Client autofs autofs RNFS Glamor-enabled NFSv4 server Local glamour agent

Namespace Features Glamour Schema Fileset LDIF Entry dn: DmDir=/dir1,DmNamespace=/, ou=locations,dc=company,dc=com objectclass: DmDir DmDir: /fileset11 DmParent: /dir1 DmNamespace: / DmType: Leaf • Multi protocol support • v3 Client Ldap option in /etc/auto.* with nismap schema dn: cn=dir1,nisMapName=/foo,dc=company,dc=com cn:dir1 nisMapEntry: -fstype=autofs ldap://server,nisMapName=/foo/dir1,dc=… objectclass:nisObject • Linux v4 Server Ldap option in /etc/exports with rnfs schema dn: cn=dir1,ou=locations,dc=company,dc=com cn:dir1 fslocations: server1:/path/at/server1 fslocations: server2:/path/at/server2 referral: TRUE objectclass:rnfs • CIFS in progress • Multi system support • AIX, Linux, … • Nested mounts • Only supported for v4 currently • No cycles • Common Glamour schema • Convert to desired schema for v3 or v4 or GPFS clients • Understood by agent in a Glamour-enabled fileserver Locations LDIF entry dn: DmLocation=server:/path/at/server, ou=locations,dc=company,dc=com objectclass: DmLocation DmLocation: server:/path/at/server Dmdir: /fileset11 DmParent: /dir1 DmNamespace: / DmSite: us DmNfsType: 3 DmHost: DmDomain: DmPath: DmVersion: DmStatus:

Towards a Federated Filesystems Standard • IBM continuing the Glamour effort to incorporate NFSv4, NFSv3, CIFS, and even local file systems such as GPFS • Prototype on Linux as reference implementation • Collaboration with NetApp and the extended NFS community for an IETF standard on administering federated filesystems

Questions www.almaden.ibm.com/StorageSystems/areas/filesys tewarir at us ibm com

GPFS, Glamour and Panache Towards a Global Scalable File Service