220 likes | 419 Views
GPFS & StoRM. Jon Wakelin University of Bristol. Pre-Amble. GPFS Basics What it is & what it does GPFS Concepts More in-depth technical concepts GPFS Topologies HPC Facilities at Bristol How we are using GPFS Creating a “mock-up”/staging-service for GridPP StoRM Recap & References.
E N D
GPFS & StoRM Jon Wakelin University of Bristol
Pre-Amble • GPFS Basics • What it is & what it does • GPFS Concepts • More in-depth technical concepts • GPFS Topologies • HPC Facilities at Bristol • How we are using GPFS • Creating a “mock-up”/staging-service for GridPP • StoRM • Recap & References
GPFS Basics • IBM’s General Parallel File System • “Scaleable high-performance parallel file system” • Numerous HA features • Life-cycle Management Tools • Provides POSIX and “extended” interfaces to data • Available for AIX and Linux • Only supported on AIX, RHEL and SuSE • Installed successfully on SL3.x (ask me if you are interested) • GPFS can run on a mix of these OSs • Pricing - per processor • Free version available through IBM’s Scholars program • Currently developing new Licensing model
GPFS Basics • Provides High-performance I/O • Divides files into blocks and stripes the blocks across disks (on multiple storage devices) • Reads/Writes the blocks in parallel • Tuneable block sizes (depends on your data) • Block-level locking mechanism • Multiple applications can access the same file concurrently • “multiple editors can work on different parts of a single file simultaneously. This eliminates the additional storage, merging and management overhead typically required to maintain multiple copies” • Client-side data-caching • Where is data cached? • Multi-Cluster Configuration • Join GPFS clusters together • Encrypted data and authentication or just authentication • openssl and keys • Different security contexts (root squash á la NFS)
GPFS Basics • Information Life-cycle Management • Tiered storage • Create groups of disks within a file system, • based on reliability, performance, location, etc • Policy driven automation • Automatically move, delete or replicate files - based on filename, username, or fileset. • e.g. Keep newest files on fastest hardware, migrate them to older hardware over time • e.g. Direct files to appropriate resource upon creation. • Other notable points • Can specify user, group and fileset quotas • POSIX and NFS v4 ACL support • Can specify different IPs for GPFS and non-GPFS traffic • Maximum limit of 268 million disks (2048 is default max)
GPFS Topologies • SAN-Attached • All nodes are physically attached to all NSDs • High performance but expensive!
GPFS Topologies • Network Shared Disk (NSD) Server • Subset of nodes are physically attached to NSDs • Other nodes forward their IO requests to the NSD servers which perform the IO and pass back the data
application Linux GPFS NSD Server GPFS Topologies • In practice, often have a mixed NSD + SAN environment • Nodes use SAN if they can and NSD servers if they can’t • If SAN connectivity fails a SAN-attached node can fallback to using remaining NSD servers Local Area Network application application application Linux Linux Linux GPFS GPFS GPFS NSD NSD NSD Server
GPFS Redundancy & HA • Non-GPFS • Redundant power supplies • Redundant hot swap fans • … • RAID with hot swappable disks (multiple IBM DS4700s) • FC with redundant paths (GPFS know how to use this) • HA Features in GPFS • Primary and secondary Configuration Servers • Primary and secondary NSD Servers for each Disk • Replicate Metadata • Replicate data • Failure Groups • Specify which machines have a single point of failure • GPFS will use this info to make sure that replicated data is not striped across failure groups
GPFS Quorum • Quorum • A “Majority” of the nodes must be present before access to shared disks is allowed • Prevent subgroups making conflicting decisions • In event of failure disks in minority suspend and those in the majority continue • Quorum Nodes • These nodes are counted to determine if the system is quorate • If the system is no longer quorate • GPFS unmounts the filesystem … • … waits until quorum is established … • … and then recovers the FS. • Quorum Nodes with Tie-Breaker Disks
GPFS Performance • Preliminary results using time dd if=/dev/zero of=testfile bs=1k count=2000000 • Multiple write processes on same node 1 process 90MB/s 2 processes 51 MB/s 4 processes 18MB/s • Multiple write processes from different nodes 1 process 90MB/s 2 processes 58 MB/s 4 processes 28 MB/s 5 processes 23 MB/s
GPFS Performance • In a hybrid environment (SAN-attached and NSD Server nodes) • Read/Writes from SAN-attached nodes place little load on the NSD servers • Read/Writes from other nodes place a high load on the NSD servers • SAN-attached [root@bf39 gpfs]# time dd if=/dev/zero of=file_zero count=2048 bs=1024k real 0m31.773s [root@bf40 GPFS]# top -p 26651 26651 root 0 -20 1155m 73m 7064 S 0 1.5 0:10.78 mmfsd • Via NSD Server [root@bfa-se /]# time dd if=/dev/zero of=/gpfs/file_zero count=2048 bs=1024k real 0m31.381s [root@bf40 GPFS]# top -p 26651 26651 root 0 -20 1155m 73m 7064 S 34 1.5 0:10.78 mmfsd
Bristol HPC Facilities • Bristol, IBM, ClearSpeed and ClusterVision • BabyBlue - installed Apr 2007 • Currently undergoing acceptance trials • BlueCrystal ~Dec 2007 • Testing • A number of “pump-priming” projects have been identified • Majority of users will develop, or port code, directly on the HPC system • Only make changes at the Application level • GridPP • System level changes • Pool accounts, World-addressable Slaves, NAT, Run services and daemons • Instead we will build testing/staging system for GridPP • In-house and loan equipment from IBM • Reasonable Analogue of HPC facilities – • No InfiniBand (but you wouldn’t use it anyway)
Bristol HPC Facilities • BabyBlue • Torque/Maui, SL 4 “Worker Node”, RHEL4 (maybe AIX) on Head-Nodes • IBM 3455, • 96 dual-core, dual-socket 2.6GHz, AMD Opterons • 4? ClearSpeed Accelerator board • 8GB RAM per node (2GB per core) • IBM DS4700 + EXP810, 15TB Transient storage • SAN/FC network running GPFS • BlueCrystal – c. Dec 2007 • Torque/Moab • 512 dual-core, dual-socket nodes (or quad-core depending on timing) • 8GB RAM per node (1GB or 2GB per core) • 50TB Storage, SAN/FC Network running GPFS • Server Room • 48 water cooled APC racks – 18 will be occupied by HPC, Physics Servers may be co-located • 3 x270kW chillers (space for 3 more)
p-Config s-Config p-NSD s-NSD quorum quorum --- --- quorum GPFS MiniBlue IBM DS4500 – Configure hot spares
StoRM • StoRM is a storage resource manager for disk based storage systems. • Implements the SRM interface version 2.2 • StoRM is designed to support guaranteed space reservation and direct access (using native POSIX I/O call) • StoRM takes advantage of high performance parallel file systems • GPFS, XFS and Lustre??? • Also standard POSIX file systems are supported • Direct access to files from “Worker Nodes” • Compare with Castor, D-Cache and DPM
StoRM architecture • Front end (FE): • Exposes the web service interface • Manages user authentication • Sends the request to the BE • Data Base(DB): • Stores SRM request and status • Stores file and space information • Back end (BE): • Binds with the underlying file systems • Enforces authorization policy on files • Manages SRM file and space metadata
StoRM miscellaneous • Scalability and high availability. • FE, DB, and BE can be deployed on different machines • StoRM is designed to be configured with n FE and m BE, using a common DB • Installation (Relatively straight forward) • RPM & Yaim (FE, BE and DB all on one server) • Additional manual configuration steps • e.g. namespace.xml, Information Providers • Not completely documented yet • Mailing list • CNAF x2 and Bristol • Basic tests - http://lxdev25.cern.ch/s2test/basic/history/ • Use Case tests - http://lxdev25.cern.ch/s2test/usecase/history/ • Currently still differences between Bristol and CNAF installations
Summary • GPFS • Scalable high-performance file system • Highly Available, built on redundant components • Tiered storage or multi-cluster configuration for GridPP work • HPC • University wide facility – not just for PP • GridPP requirements rather different from general/traditional HPC users • Build an “analogue” of the HPC system for GridPP • StoRM • Better performance because StoRM builds on • Also, more appropriate data transfer model – POSIX and “file” protocol
References • GPFS • http://www-03.ibm.com/systems/clusters/software/gpfs.pdf • http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf • http://www-03.ibm.com/systems/clusters/software/whitepapers/gpfs_intro.pdf • Storm • http://hst.home.cern.ch/hst/publications/storm_chep06.pdf • http://agenda.cnaf.infn.it/getFile.py/access?contribId=10&resId=1&materialId=slides&confId=0