320 likes | 434 Views
SSD – Applications, Usage Examples Gordon Summer Institute August 8-11, 2011. Mahidhar Tatineni San Diego Supercomputer Center. Overview . Introduction to flash hardware and benefits Flash usage scenarios
E N D
SSD – Applications, Usage ExamplesGordon Summer InstituteAugust 8-11, 2011 Mahidhar Tatineni San Diego Supercomputer Center
Overview • Introduction to flash hardware and benefits • Flash usage scenarios • Examples of applications tested on Dash, Trestles compute nodes and Dash/Gordon I/O nodes. • Flash access/remote mounts on Dash, Trestles, and Gordon.
Gordon Architecture Bridges the Latency Gap I/O to traditional HPC FS Data Oasis Lustre 4PB PFS 64 I/O nodes 300 TB Intel SSD Application Space Quick Path Interconnect 10’s of GB (lower is better) I/O to flash node FS QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB DDR3 Memory 10’s of GB L2 Cache KB Data Capacity (GB) (higher is better)
Flash Drives are a Good Fit for Data Intensive Computing . Apart from the differences between HDD and SSD it is not common to find local storage “close” to the compute. We have found this to be attractive in our Trestles cluster, which has local flash on the compute, but is used for traditional HPC applications (not high IOPS).
Dash uses Intel X25E SLC drives and Trestles has X25-M MLC drives. • The performance specs of the Intel flash drives to be deployed in Gordon are similar to those of the X25-M except that they will have higher endurance
Flash Usage Scenarios • Node local scratch for I/O during a run • Very little or no code changes required • Ideal if there are several threads doing I/O simultaneously and often. • Examples: Gaussian, Abaqus, QCHEM • Caching of partial or complete dataset in analysis, search, and visualization tasks • Loading entire database into flash • Use flash via a filesystem • Use raw device [DB2]
Flash as Local Scratch • Applications which do a lot of local scratch I/O during computations. Examples: Gaussian, Abaqus, QCHEM • Using flash is very straightforward. For example on Trestles where local SSDs are available: • Gaussian: GAUSS_SCRDIR=/scratch/$USER/$PBS_JOBID • Abaqus: scratch=/scratch/$USER/$PBS_JOBID • When a lot of cores (up to 32 on Trestles) are doing I/O and reading/writing constantly, the SSDs can make a significant difference. • Parallel filesystems not ideal for such I/O.
Flash as local scratch space provides 1.5x- 1.8x speedup over local disk for Abaqus • Standard Abaqus test cases (S2A1, S4B) were run on Dash with 8 cores to compare performance between local hard disk and SSDs. Benchmark performance was as follows:
Reverse-Time-Migration Application Acoustic Imaging Application Used to create images of sub-surface structures Oil and Gas companies use RTM to plan drilling investments This is a computation research that is sponsored by a commercial user Correlation between source data and recorded data forward-propagated seismic waves backward-propagated seismic waves Correlation between seismic waves illuminates reflection/diffraction points Temporary Storage Requirements Snapshots stored for correlation Example 4003 max grid points 20000 msec ~60GB temporary storage used Example: Computation-IO Profile
Storage comparison on batch nodes Spinning disk (HDD), flash drives (flash), parallel file system (GPFS) Local flash drive outperforms other storages Avg 7.2x IO speedup vs HDD Avg 3.9x IO speedup vs GPFS IO-node RAID’d-flash Comparison with RAID’d drives 16 Intel drives 4 Fusion-io cards Raided flash achieves 2.2x speedup compared to single drive * Done by PietroCicotti, SDSC Reverse-Time-Migration on Flash*
Local SSD to Cache Partial/Full Dataset • Load partial/full dataset into flash. • Typically needs application modification to write data into flash and do all subsequent reads from flash. • Example: Munagala-Ranade Breadth First Search (MR-BFS) code: • Generation phase -> puts the data in flash. • Multiple MR-BFS runs read and process data. • Multiple threads reading, benefits from low latency of SSDs.
Flash case study – Breadth First Search* Implementation of Breadth-first search (BFS) graph algorithm developed by Munagala and Ranade Benchmark problem: BFS on graph containing 134 million nodes Use of flash drives reduced I/O time by factor of 6.5x. As expected, no measurable impact on non-I/O operations Problem converted from I/O bound to compute bound * Done by Sandeep Gupta, SDSC
Flash for caching: Case study – Parallel Streamline Visualization Camp et al, accepted to IEEE Symp. on Large-Scale Data Analysis and Visualization (LDAV 2011)
Databases on Flash • Database performance benefits from low latency I/O from flash • Two options for setting up database: • Load database on flash based filesystem, already tested on Dash I/O nodes. • DB2 with direct native access to flash memory (coming soon!).
LIDAR Data Lifecycle D. Harding, NASA Full-featured DEM Portal Waveform Data Bare earth DEM Point Cloud Dataset OpenTopography is a “cloud” for topography data and tools
LIDAR benchmarking* and experiments on a Dash I/O node Experiments with LIDAR point cloud data with data sizes ranging from 1GB to 1TB using DB2. Experiments to be performed include: Load times: time to load each dataset Single user Selection times: for selecting 6%, 12%, 50% of data Single user Processing times: for DEM generation on selected data4. Multiuser: for a fixed dataset size (either 100GB or 1TB), run selections and processing for multiple concurrent users, e.g. 2, 4, 8, 16 concurrent users Logical nodes testing: for a fixed dataset size (100GB or 1TB), db2 has the option of creating multiple “logical nodes” on a given system (“physical node”). Test what is optimal number of logical nodes on an SSD node *ChaitanBaru’s group at SDSC.
Flash case study – LIDAR Remote sensing technology used to map geographic features with high resolution Benchmark problem: Load 100 GB data into single table, then count rows. DB2 database instance Flash drives 1.5x (load) to 2.4x (count) faster than hard disks
Flash case study – LIDAR Remote sensing technology used to map geographic features with high resolution Comparison of runtimes for concurrent LIDAR queries obtained with flash drives (SSD) and hard drives (HDD) using the Alaska Denali-Totschunda data collection. Impact of SSDs was modest, but significant when executing multiple simultaneous queries
PDB – protein interaction query • First step in analysis involves reduction of 150 million row data base table to one million rows. Use of flash drives reduced query time to 3 minutes, 10x speedup over hard disk • Dash I/O node configuration • Four 320 GB Fusion-ioDrives configured as 1.2 TB RAID 0 device running an XFS file system • Two quad-core Intel Xeon E5530 2.40 GHz processors and 48 GB of DDR3-1066 memory
Sample Script on Dash #!/bin/bash #PBS -N PBStest #PBS -l nodes=1:ppn=8 #PBS -l walltime=01:00:00 #PBS -o test-normal.out #PBS -e test-normal.err #PBS -m e #PBS -M mahidhar@sdsc.edu #PBS -V #PBS –q batch cd /scratch/mahidhar/$PBS_JOBID cp-r /home/mahidhar/COPYBK/input /scratch/mahidhar/$PBS_JOBID mpirun_rsh-hostfile $PBS_NODEFILE -np 8 test.exe cpout.txt /home/mahidhar/COPYBK/
Dash Prototype vs. Gordon When considering benchmark results and scalability, keep in mind that nearly every major feature of Gordon will be an improvement over Dash.
Accessing Flash on Gordon • Majority of the flash disk will be in the 64 Gordon I/O nodes. Each I/O node will have ~ 4.8TB of flash. • Flash from I/O nodes will be made available to non-vSMP compute nodes via the IB network and iSER implementations. Two options will be available: • XFS filesystem mounted locally on each node. • Oracle Cluster Filesystem (OCFS) • vSMP software will aggregate the flash from the I/O node(s) included in the vSMP nodes. The aggregated flash filesystem will be available as local scratch on the node.
Flash performance needs to be freed from the I/O nodes Application is here Flash is here
Alphabet Soup of networking protocols, and file systems • SRP - SCSI over RDMA • iSER - iSCSI over RDMA • NFS over RDMA • NFS/IP over IB • Xfs – via iSER devices • Lustre • OCFS – via iSER devices • PVFS • OrangeFS • Others… In our effort to maximize flash performance we have tested most of these. BTW:Very few peopledoing this!
Flash performance – parallel file system Performance of Intel Postville Refresh SSDs (16 drives RAID 0) with OCSF (Oracle Cluster File System) I/O done simultaneously from 1, 2, or 4 compute nodes MT = multi-threaded EP = embarrassingly parallel
Flash performance – serial file system Performance of Intel Postville Refresh SSDs (16 drives RAID 0) with XFS I/O done simultaneously from 1, 2, or 4 compute nodes MT = multi-threaded EP = embarrassingly parallel
Summary • The early hardware has allowed us to test applications, protocols and file systems. • I/O profiling tools and running different application flash usage scenarios have helped optimize application I/O performance. • Performance test results point to iSER, OCFS, and XFS as the right solutions for exporting flash. • Further work required to integrate into user documentation, systems scripts, and the SLURM resource manager.
Discussion • Attendee I/O access pattern/method.
Thank you! For more information http://gordon.sdsc.edu gordoninfo@sdsc.edu Mahidhar Tatineni mahidhar@sdsc.edu