460 likes | 616 Views
ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab . Andrei Maslennikov CASPUR Consortium May 2004. Participated : ADIC Software : E.Eastman CASPUR : A.Maslennikov (*) , M.Mililotti, G.Palumbo
E N D
ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGINew results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004
Participated: • ADIC Software : E.Eastman • CASPUR : A.Maslennikov(*), M.Mililotti, G.Palumbo • CERN : C.Curran, J.Garcia Reyero, M.Gug, A.Horvath, J.Iven, • P.Kelemen, G.Lee, I.Makhlyueva, B.Panzer-Steindel, • R.Többicke, L.Vidak • DataDirect Networks : L.Thiers • ENEA : G.Bracco, S.Pecoraro • IBM : F.Conti, S.De Santis, S.Fini • RZ Garching : H.Reuter • SGI : L.Bagnaschi, P.Barbieri, A.Mattioli • (*) Project Coordinator A.Maslennikov - May 2004 - SLAB update
Sponsors for these test sessions: • ACAL Storage Networking : Loaned a 16-port Brocade switch • ADIC Soiftware : Provided the StorNext file system product, • actively participated in tests • DataDirect Networks : Loaned an S2A 8000 disk system, • actively participated in tests • E4 Computer Engineering : Loaned 10 assembled biprocessor nodes • Emulex Corporation : Loaned 16 fibre channel HBAs • IBM : Loaned a FASTt900 disk system and • SANFS product complete with 2 MDS units, • actively participated in tests • Infortrend-Europe : Sold 4 EonStor disk systems at discount price • INTEL : Donated 10 motherboards and 20 CPUs • SGI : Loaned the CXFS product • Storcase : Loaned an InfoStation disk system A.Maslennikov - May 2004 - SLAB update
Contents • Goals • Components under test • Measurements: • - SATA/FC systems • - SAN File Systems • - AFS Speedup • - Lustre (preliminary) • - LTO2 • Final remarks A.Maslennikov - May 2004 - SLAB update
Goals for these test series • Performance of low-cost SATA/FC disk systems • Performance of SAN File Systems • AFS Speedup options • Lustre • Performance of LTO-2 tape drive A.Maslennikov - May 2004 - SLAB update
Disk systems: • 4x Infortrend EonStor A16F-G1A2 16 bay SATA-to-FC arrays: • Maxtor Maxline Plus II 250 GB SATA disks (7200 rpm) • Dual Fibre Channel outlet at 2 Gbit • Cache: 1 GB • 2x IBM FAStT900 dual controller arrays with SATA expansion units: • 4 x EXP100 expansion units with 14 Maxtor SATA disks of the same type • Dual Fibre Channel outlet at 2 Gbit • Cache: 1 GB • 1x StorCase InfoStation 12 bay array: • same Maxtor SATA disks • Dual Fibre Channel outlet at 2 Gbit • Cache: 256 MB • 1x DataDirect S2A 8000 System: • 2 controllers with 74 FC disks of 146GB • 8 Fibre Channel outlets at 2 Gbit • Cache: 2.56 GB Components A.Maslennikov - May 2004 - SLAB update
Infortrend EonStor A16F-G1A2 • - Two 2Gbps Fibre Host Channels • - RAID levels supported: RAID 0, 1 (0+1), 3, 5, 10, 30, 50, NRAID and JBOD • - Multiple arrays configurable with dedicated or global hot spares • - Automatic background rebuild • - Configurable stripe size and write policy per array • - Up to 1024 LUNs supported • - 3.5", 1" high 1.5Gbps SATA disk drives • - Variable stripe size per logical drive • - Up to 64TB per LD • - Up to 1GB SDRAM
FAStT900 Storage Server • - 2 Gbps SFP • - Expansion units: EXP700 FC / EXP100 sATA • - Four SAN (FW-SW), or eight direct (FC-AL) • - Four (redundant) 2 Gbps drive channels • - Capacity: min 250GB – max 56TB (14 disks x EXP100 sATA) • min 32GB – max 32TB (14 disks x EXP700 FC) • - Dual-active controllers • - Cache: 2GB • - RAID support 0, 1, 3, 5, 10 EXP100 FAStT900
STORCase Fibre-to-SATA • - SATA and Ultra ATA/133 Drive Interface • - 12 hot swappable drives • - Switched or FC-AL host connections • - RAID levels: 0, 1, 0+1, 3, 5, 30, 50 and JBOD • - Dual Fibre 2Gbps host ports • - Support up to 8 arrays and 128 LUNs • - Up to 1GB PC200 DDR cache memory
DataDirect S²A8000 • - Single 2U S2A8000 with Four 2Gb/s Ports or Dual 4U • with Eight 2Gb/s Ports • - Up to 1120 Disk Drives; 8192 LUNs supported • - 5TB to 130TB with FC Disks, 20TB to 250TB with SATA disks • - Sustained Performance well over 1GB/s (1.6 GB/s theoretical) • - Full Fibre-Channel Duplex Performance on every port • - PowerLUN™ 1 GB/s+ individual LUNs without host-based striping • - Up to 20GB of Cache, LUN-in-Cache Solid State Disk functionality • - Real time Any to Any Virtualization • - Very fast rebuild rate
Components • High-end Linux units for both servers and clients • Biprocessor Pentium IV Xeon 2.4+ GHz, 1GB RAM • Qlogic QLA2300 2Gbit or Emulex LP9xxx Fibre Channel HBAs • Network • 2x Dell 5224 GigE switches • SAN • Brocade 3800 switch – 16 ports (test series 1) • Qlogic Sanbox 5200 – 32 ports (test series 2) • Tapes • 2x IBM Ultrium LTO2 (3580-TD2, Rev: 36U3 ) A.Maslennikov - May 2004 - SLAB update
Qlogic SANbox 5200 Stackable Switch • - 8, 12 or 16 auto-detecting 2Gb/1Gb device ports with 4-port incremental upgrade • - Stacking of up to 4 units for 64 available user ports • - Interoperable with all FC SW-2 compliant Fibre Channel switches • - Full-fabric, public-loop or switch-to-switch connectivity on 2Gb or 1Gb front ports • - "No-Wait" routing - guaranteed maximum performance independent of data traffic • - Support traffic between switches, servers and storage at up to 10Gb/s • - Low cost: 5200/16p is at least twice less expensive than Brocade 3800/16p • - May be upgraded in 8p steps
IBM LTO Ultrium 2 Tape Drive Features • - 200 GB Native Capacity (400 GB compressed) • - 35 MB/s native (70 MB/s compressed) • - Read/Write LTO 1 Cartridge • - Native 2Gb FC Interface • - Backward read/write with Ultrium 1 cartridge • - 64 MB buffer (vs 32 MB buffer in Ultrium 1) • - Speed Matching, Channel Calibration • - 512 Tracks vs. 384 Tracks in Ultrium 1 • - 64 MB Buffer vs. 32 MB in Ultrium 1 - Enhanced Capacity (200GB) - Enhanced Performance (35 MB/s) - Backward Compatible - Faster Load/Unload Time, Data Access Time, Rewind Time
SATA / FC Systems A.Maslennikov - May 2004 - SLAB update
SATA / FC Systems – hw details • Typical array features: • - single o dual (active-active) controller • - up to 1GB of Raid Cache • - battery to keep the cache afloat during power cuts • - 8 through 16 drive slots • - cost: 4-6 KUSD per 12/16 bay unit (Infortrend, Storcase) • Case and backplane directly impact on the disks’ lifetime: • - protection against inrush currents • - protection against the rotational vibration • - orientation (H better than V – remark by A.Sansum) • Infortrend EonStor: well engineered (removable controller module, • lower vibration, H orientation) • Storcase: special protection against inrush currents • (“soft-start” drive power circuitry), low vibration A.Maslennikov - May 2004 - SLAB update
High capacity ATA/SATA disk drives: • - 250GB (Maxtor, IBM), 400GB (Hitachi) • - RPM: 7200 • - improved quality: • warranty 3 years, • component design lifetime : 5 years • CASPUR experience with Maxtor drives: • - In 1.5 years lost 5 drives out of ~100, 2 of which due to power cuts • - Factory quality for recent Maxtor Maxline Plus II 250 GB disks: • out of 66 disks purchased, 4 were shortly replaced. Others stand • the stress very well • Learned during this meeting: • - RAL annual failure rate is 21 out of 920 Maxtor Maxline drives SATA / FC Systems – hw details A.Maslennikov - May 2004 - SLAB update
SATA / FC Systems – test setup 4x IFT A16F- G1A2 Qlogic 2x 5200 16 2x2.4+ GHz Nodes Qlogic 2310F HBA Dell 5224 4x IBM FASTt 900 Storcase Infostation • Parameters to select / tune: • - stripe size for RAID-5 • - SCSI queue depth on controller and on Qlogic HBAs • - number of disks per logical drive • In the end, we were working with RAID-5 LUNs composed of 8 HDs each • Stripe size: 128K (and 256K, in some tests) A.Maslennikov - May 2004 - SLAB update
SATA / FC tests – kernel and fs details • Kernel settings: • - Kernels: 2.4.20-30.9smp, 2.4.20-20.9.XFS1.3.1smp • - vm.bdflush: “2 500 0 0 500 1000 20 10 0” • - vm.max(min)-readahead: 256(127) (large streaming writes) • 4(3) (random reads with small blksize) • File Systems: • - EXT3 (128k RAID-5 stripe size): • fs options: “-m O –j –J size=128 –R stride=32 –T largefile4” • mount options: “data=writeback” • - XFS 1.3.1 (128k RAID-5 stripe size): • fs options: “-i size=512 –d agsize=4g,su=128k,sw=7,unwritten=0 –l su=128k” • mount options: “logbsize=262144,logbufs=8” A.Maslennikov - May 2004 - SLAB update
SATA / FC tests – benchmarks used • Large serial writes and reads: • - “lmdd” from “lmbench” suite: http://sourceforge.net/projects/lmbench • typical invocation: • lmdd of=/fs/file bs=1000k count=8000 fsync=1 • Random reads: • - Pileup benchmark (Rainer.Toebbicke@cern.ch) • designed to emulate the disk activity for multiple data analysis jobs • 1) series of 2GB files are being created in the desination directory • 2) these files are then being read in a random way, in many threads A.Maslennikov - May 2004 - SLAB update
EXT3 results – filling 1.7 TB with 8GB files • IFT systems show anomalous behaviour with EXT3 file system: performance • varies along the file system. The effect visibly depends on the RAID-5 stripe size: SATA / FC results 32K 128k 256K ! The problem was reproduced and understood by Infortrend New firmware is due in July A.Maslennikov - May 2004 - SLAB update
SATA / FC results • IBM FAStT and Storcase behave in a more predictable manner with EXT3. • Both these systems may however lose up to 20% in performance along the • file system: A.Maslennikov - May 2004 - SLAB update
XFS results – filling 1.7 TB with 8GB files • The situation changes radically with this file system. The curves are now becoming • almost flat, everything is much faster compared with EXT3: SATA / FC results • IBM STORCASE INFORTREND • Infortrend and Storcase show compatible write speeds of about 135-140 MB/sec, • IBM is much slower on writes (below 100 MB/sec). • Read speeds are visibly higher thanks to the read-ahead function of controller • (IBM and IFT systems had 1 GB of raid cache, Storcase had only 256 MB) A.Maslennikov - May 2004 - SLAB update
SATA / FC results • Pileup tests: • These tests were done only on IFT and Storcase systems. Results to a large • extent depend on the number of threads that access the previously prepared • files (after a certain number of threads performance may drop since the test • machine’s may have problems to handle many threads at a time). • The best result was obtained with the Infortrend array for XFS file system: A.Maslennikov - May 2004 - SLAB update
SATA / FC results • Operation in degraded mode: • We have tried it on a single Infortrend LUN of 5HDs and EXT3. • One of the disks was removed, and rebuild process was started. • The Write speed went down from 105 to 91 MB/sec • The Read speed went down from 105 to 28 MB/sec and even less A.Maslennikov - May 2004 - SLAB update
SATA / FC results - conclusions • 1) The recent low-cost SATA-to-FC disk arrays (Infortrend, Storcase) operate • very well and are able to deliver excellent I/O speeds far exceeding that of • Gigabit Ethernet. • Cost of such systems may be as low as 2.5 USD/rawGB. • Quality of these systems is dominated by the quality of SATA disks. • 2) The choice of local file system is fundamental. XFS easily outperforms EXT3. • In one occasion we have observed an XFS hang under a very heavy load. • “xfs_repair” was run, and the error had never reappeared again. • We are now planning to investigate this in deep. CASPUR AFS and NFS • servers are all XFS-based, and there was only one XFS-related problem • since we have put XFS in production 1.5 years ago. But probably we were • simply lucky. A.Maslennikov - May 2004 - SLAB update
SAN File Systems A.Maslennikov - May 2004 - SLAB update
SAN FS Placement • These advanced distributed file systems allow clients to operate directly • with block devices (block-level file access). Metadata traffic: via GigE. • Required: Storage Area Network. • Current cost of a single fibre channel connection > 1000 USD: • Switch port, min ~ 500 USD including GBIC • Host Based Adapter, min ~ 800 USD • Special discounts for massive purchases are not impossible, • but it is very hard to imagine that the cost of connection will • become less than 600-700 USD in the close future.. • SAN FS with native fibre channel connection is still not • an option for large farms. SAN FS with iSCSI connection • may be re-evaluated in combination with new iSCSI-SATA • disk arrays. SAN File Systems A.Maslennikov - May 2004 - SLAB update
SAN File Systems • Where SAN File Systems with FC connection may be used: • 1) High Performance Computing – fast parallel I/O, faster sequential I/O • 2) Hybrid SAN / NAS systems: relatively small number of SAN clients • acting as (also redundant) NAS servers • 3) HA Clusters with file locking : Mail (shared pool), Web etc A.Maslennikov - May 2004 - SLAB update
SAN File Systems • So far, we have tried these products: • 0) Sistina GFS (see our 2002 and 2003 reports) • 1) ADIC StorNext File System • 2) IBM SANFS (StorTank) (preliminary, continue looking into it) • 3) SGI CXFS (work in progress) A.Maslennikov - May 2004 - SLAB update
SAN File Systems A.Maslennikov - May 2004 - SLAB update
SAN File Systems 16 2x2.4+ GHz Nodes Qlogic 2310F HBA 4x IFT A16F- G1A2 Qlogic 2x 5200 Dell 5224 4x IBM FASTt 900 IA32 IBM StorTank MDS Origin 200 CXFS MDS • What was measured (StorNext and StorTank): • 1) Aggregate write and read speeds on 1, 7 and 14 clients • 2) Aggregate Pileup speed on 1,7, and 14 clients accessing: • A) different sets of files • B) same set of files • During these tests we used 4 LUNS of 13 HDs each as recommended by IBM • For each SAN FS we have tried both IFT and FAStT disk systems A.Maslennikov - May 2004 - SLAB update
SAN File Systems • Large sequential files: • StorNext and StorTank behave in a similar manner on writes. StorNext does better • on reads. IBM disk systems are performing better than IFT on reads for multiple clients: IBM StorTank ADIC StorNext All numbers in MB/sec A.Maslennikov - May 2004 - SLAB update
Pileup tests: • StorTank is definitevely outperforming StorNext for this type of benchmark. • The results are very interesting as it comes out that peak Pileup speeds with • StorTank on a single client may reach the GigE speed (case of IFT disk): SAN File Systems IBM StorTank ! Unstable for IFT with more than 1 client ADIC StorNext All numbers in MB/sec A.Maslennikov - May 2004 - SLAB update
SAN File Systems • CXFS experience: • MDS: on SGI Origin 200 with 1 GB of RAM (IRIX 6.5.22), 4 IFT arrays • First numbers were not so bad, but with 4 clients or more the system • becomes unstable (when they are used all at a time, one client will hang). • That is what we have observed so far: • We are currently investigating the problem together with SGI. A.Maslennikov - May 2004 - SLAB update
SAN File Systems • StorNext on DataDirect system 2x S2A8000 8 FC outlets 2x Brocade 3800 16 2x2.4+ GHz Nodes Emulex LP9xxx HBAs Dell 5224 • - S2A 8000 came with FC disks, although we asked for SATA • - Quite easy in configuration, extremly flexible • - Multiple levels of redundancy, small declared performance degradation on rebuilds • - We ran only large serial wrirte and read 8GB lmdd tests using all the available power: A.Maslennikov - May 2004 - SLAB update
SAN File Systems – some remarks • - Performance of a SAN File System is quite close to that of disk hardware • it is built upon (case of native FC connection). • - StorNext is easiest in configuration. It does not require a standalone MDS. • Works smoothly with all kinds of disk systems, fc switches etc We were able to • export it via NFS, but with the loss of 50% of available bandwidth. iSCSI=? • - StorTank is probably the most solid implementation of SAN FS, and it has • a lot of useful options. It delivers the best numbers for random reads, and • probably may be considered as a good candidate for relatively small clusters • with native FC connection destinated for express data analysis. May have issues • with 3rd party disks. Supports iSCSI. • - CXFS uses the very performant XFS base and hence should have a good • potential, although the 2 TB file system size on Linux/32bit is a real limitation • (same is true for GFS). Some functions like MDS fencing require particular • hardware. iSCSI=? • - MDS loads: small for StorNext, CXFS and quite high for StorTank. A.Maslennikov - May 2004 - SLAB update
AFS Speedup A.Maslennikov - May 2004 - SLAB update
- AFS performance for large files is quite poor (max 35-40 MB/sec even on a very • performant hardware). To a large extent this is due to the limitations of Rx RPC • protocol, and to the not most optimal implementation of the file server. • - One possible workaround is to replace the Rx protocol with an alternative one in • all cases where it is used for file serving. We were evaluating two such • experimental implementations: • 1) AFS with OSD support (Rainer Toebbicke). Rainer stores AFS data • inside the Object-based Storage Devices (OSDs) which should not • necessarily reside inside the AFS File Servers. The OSD performs • basic space management and access control and is implemented • as Linux daemon in user space on an EXT2 file system. AFS file • server acts only as an MDS. • 2) Reuter’s Fast AFS (Hartmut Reuter). In this approach, AFS partitions • (/vicepXX) are made visible on the clients with fast SAN or NAS mechanism. • As in the case 1), AFS file sever acts as an MDS and directs the clients • to the right files inside the /vicepXX for faster data acess. AFS speedup options A.Maslennikov - May 2004 - SLAB update
Both methods worked! • The AFS/OSD scheme was tested during the Fall 2003 test session, • the tests were done with the DataDirect’s S2A 8000 system. In one particular • test we were able to achieve 425 MB/sec write speed for both native EXT2 • and AFS/OSD configurations. • The Reuter AFS was evaluated during the Spring 2004 session. StorNext • SAN File System was used to distribute a /vicepX partition among several • clients. Like in the previous case, AFS/Reuter performance was practically • equal to the native performance of StorNext for large files. • To learn more on the DataDirect system and the Fall 2003 session, • please visit the following site: http://afs.caspur.it/slab2003b. AFS speedup options A.Maslennikov - May 2004 - SLAB update
Lustre! A.Maslennikov - May 2004 - SLAB update
- Lustre 1.0.4 • - We used 4 Object Storage Targets on 4 Infortrend arrays, no striping • - Very interesting numbers for sequential I/O (8GB files, MB/sec): Lustre – preliminary results • - These numbers may be directly compared with SAN FS results obtained • with the same disk arrays: A.Maslennikov - May 2004 - SLAB update
LTO-2 Tape Drive A.Maslennikov - May 2004 - SLAB update
LTO-2 tape drive • The drive is a “Factor 2” evolution of its predecessor, LTO-1. • According to the specs, it should be able to deliever up to 35 MB/sec • native I/O speed, and 200 GB of native capacity. • We were mainly interested to check the following (see next page): • - write speed as a function of block size • - time to write a tape mark • - positioning times • The overall judgement: quite positive. The drive fits well for backup • applications, and is acceptable for staging systems. Its strong point • Is definitively a relatively low cost (10-11 KUSD) which makes it quite • competitive (cmp with ~30 KUSD for STK 9940B). A.Maslennikov - May 2004 - SLAB update
LTO-2 • Write speed as a function of blocksize: • > 31 MB/sec native for large blocks, very stable • Tape mark writing is rather slow, 1.4-1.5 sec/TM • Positioning: it may take up to 1.5 minutes to fsf • to the needed file (Average= 1minute) A.Maslennikov - May 2004 - SLAB update
Final remarks • Our immediate plans include: • - Further investigation of StorTank, CXFS and yet another • SAN file system (Veritas) including NFS export • - Evaluation of iSCSI-enabled SATA RAID arrays • in combination with SAN file systems • - Further Lustre testing on IFT and IBM hardware • (new version: 1.2, striping, other benchmarks) • Feel free to join us at any moment ! A.Maslennikov - May 2004 - SLAB update