First experiences with large SAN storage in a Linux cluster

First experiences with large SAN storage in a Linux cluster Jos van Wezel Institute for Scientific Computing Karlsruhe, Germany jvw@iwr.fzk.de

Overview • The GridKa center • SAN and Parallel storage • hardware • software • Performance tests results • NFS load balancing Jos van Wezel / ACAT03

GridKa in a nutshell • Test environment for LHC (ALICE, ATLAS, CMS, LHCb) • Test and development environment for CrossGrid, LCG, .... • Production platform for BaBar, CDF, D0, Compass • Tier 1 for LHC after 2007 • 2003 • 500 CPUs • 120 TB disk storage • 350 TB tape storage • 2 Gb feed • 2007 (est.) • 4000 CPUs • 1200 TB disk storage • 3500 TB tape storage • ? Gb feed Jos van Wezel / ACAT03

SAN components • Disk racks • 5 x 140 x 146GB FC disks • 5 x 4 port LSI controller • 70 x 1 TB LUNs • File servers • 2 x 2.4 GHz XEON • 1.5 MB Mem, 18Gb disk • Qlogic 2312 HBA • 2 x Broadcom 1Gb Ethernet • Fibre Channel Switch • 128 X 2 Gb ports • non blocking fabric Jos van Wezel / ACAT03

Storage Area Network Advantages • Easier administration: all storage is seen on all file servers • Expansion or exchange during production: less down time • Load balancing / redundancy: several paths to same storage • Very fast: 2Gb/s (200MB/s) • Scalable: add switches and controllers • Small overhead and CPU load: HBA handles protocol Disadvantages • Expensive: 1000$ per HBA or Port • prices will drop with iSCSI • a (S)ATA + IP solution would need more controllers (host) and a separate LAN • No standardized (fabric) management With est. 4000 disks in 2007 there is no manageable alternative expect a (SCSI) disk to fail every 2 weeks Jos van Wezel / ACAT03

GPFS features • Parallel file system • R/W to the same file from more then one node • High scalable • scales with number of disks • scales with number of nodes • distributed locks • UNIX/POSIX IO semantics • Large volume sizes • currently 18TB (Linux ext2 max 1.9TB) • VFS layer. Possibility to export via e.g. NFS • Extensive fault tolerance and recovery possibilities • survives failed nodes • mirroring • fsck • On-line file system expansion and disk replacement • Proprietary IBM Jos van Wezel / ACAT03

Application Application Application GPFS GPFS GPFS FC driver FC driver FC driver disk collection GPFS functional diagram IP network file servers switch, fibre channel fabric Jos van Wezel / ACAT03

GridKa scalable IO design . . . n Compute nodes IP/TCP/NFS Expansion (disks, servers) Servers SAN/SCSI Fibre Channel RAID storage Jos van Wezel / ACAT03

Environment and throughput tests • Setup: • 10 File systems / Mount points • Each file system comprises 5 RAID5 (7+P) groups • Kernel 2.4.20-18 on servers • Kernel 2.4.18-27 on clients • NFS V3 on server and clients (over UDP) • Random IO • Multiple threads on GPFS node • Sequential IO • W/R on GPFS node • W/R on NFS to GPFS Jos van Wezel / ACAT03

Reading Writing Throughput vs. number of threads on 1 GPFS node MB/s Jos van Wezel / ACAT03

Reading Writing Accumulated throughput as function of number of nodes/raid-arrays MB/s Jos van Wezel / ACAT03

Reading Writing Accumulated throughput as function ofnumber of NFS clients MB/s Jos van Wezel / ACAT03

NFS load balancing (components) Use Automounter with a program map Transparent for existing NIS maps DNS is used to add or remove servers via multiple PTR records • Program map algorithm. For a given key: • retrieve existing NIS entry • find DNS host(s) • select host randomly if more then one • check availability with nfsping • return NIS entry with, possibly replaced, hostname Jos van Wezel / ACAT03

NFS load balancing (result) Number of mounts per server in recent weeks Jos van Wezel / ACAT03

Conclusions • SAN and Linux go together very well • NFS on the recent Linux kernels is a huge improvement • The GPFS/NFS combination is a viable cluster storage solution • Relationship between local server and NFS throughput unclear Work to do • Improve proportion local-server-throughput vs NFS-throughput • Improve write behavior • Connection to background storage (tape via dCache) Thank you: Manfred Alef, Ruediger Berlich, Michael Gehle, Marcus Hardt, Bruno Hoeft, Axel Jaeger, Melanie Knoch, Marcel Kunze, Holger Marten, Klaus-Peter Mickel, Doris Ressmann, Ulrich Schwickerath, Bernhard Verstege, Ingrid Schaeffner

First experiences with large SAN storage in a Linux cluster

First experiences with large SAN storage in a Linux cluster

Presentation Transcript

FIRST HAND EXPERIENCES IN RUNNING A LEARNERSHIP

Linux Cluster Production Readiness

Administration Tools for Managing Large Scale Linux Cluster

Cosmology with Large Optical Cluster Surveys

Chapter 5 First step with Linux

Experiences with Large Data Sets

Linux High-Availability Cluster

Experiences Tuning Cluster Hosts

Minimalist ’ s Linux Cluster

Experiences of Cluster Development in India

MaPMT Readout with boardBeetle: First Experiences

Administration Tools for Managing Large Scale Linux Cluster

IT Cluster Experiences INFOPOLE Cluster TIC

Cluster Computing with Linux

First experiences with CORBA

in Large-Scale Cluster

Hepc Linux Cluster

HBA Distributed Metadata Management for Large Cluster Based Storage Systems

Experiences with Large Data Sets

First experiences with CORBA