Research @ Northeastern University

Research @ Northeastern University • I/O storage modeling and performance • David Kaeli • Soft error modeling and mitigation • Mehdi B. Tahoori April 2005

I/O Storage Research at Northeastern University David Kaeli Yijian Wang Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

Outline • Motivation to study file-based I/O • Profile-driven partitioning for parallel file I/O • I/O Qualification Laboratory @ NU • Areas for future work April 2005

Important File-base I/O Workloads • Many subsurface sensing and imaging workloads involve file-based I/O • Cellular biology – in-vitro fertilization with NU biologists • Medical imaging – cancer therapy with MGH • Underwater mapping – multi-sensor fusion with Woods Hole Oceanographic Institution • Ground-penetrating radar – toxic waste tracking with Idaho National Labs April 2005

Air Mine Soil The Impact of Profile-guided Parallelization on SSI Applications • Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster • Hot-path parallelization • Data restructuring • Reduced the runtime of a Monte Carlo scattered light simulation by 98% on a 16-node Silicon Graphics Origin 2000 • Matlab-to-C compliation • Hot-path parallelization • Obtained superlinear speedup of Ellipsoid Algorithm run on a 16-node IBM SP2 • Matlab-to-C compliation • Hot-path parallelization April 2005

Limits of Parallelization • For compute-bound workloads, Beowulf clusters can be used effectively to overcome computational barriers • Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems • Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit) • For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems April 2005

Parallel I/O Acceleration • The I/O bottleneck • The growing gap between the speed of processors, networks and underlying I/O devices • Many imaging and scientific applications access disks very frequently • I/O intensive applications • Out-of-core applications • Work on large datasets that cannot fit in main memory • File-intensive applications • Access file-based datasets frequently • Large number of file operations April 2005

Introduction • Storage architectures • Direct Attached Storage (DAS) • Storage device is directly attached to the computer • Network Attached Storage (NAS) • Storage subsystem is attached to a network of servers and file requests are passed through a parallel filesystem to the centralized storage device • Storage Area Network (SAN) • A dedicated network to provide an any-to-any connection between processors and disks April 2005

… P P P Multiple Processes (i.e. MPI-IO) Disk Multiple disks (i.e. RAID) … P P P P … … Disk Disk Disk Disk Disk Disk Data Partitioning Data Striping I/O Partitioning P An I/O intensive application Disk April 2005

I/O Partitioning • I/O is parallelized at both the application level (using MPI and MPI-IO) and the disk level (using file partitioning) • Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing) • How to recognize the access patterns? • Profile-guided approach April 2005

Profile Generation Run the application Capture I/O execution profiles Apply our partitioning algorithm Rerun the tuned application April 2005

I/O traces and partitioning • For every process, for every contiguous file access, we capture the following I/O profile information: • Process ID • File ID • Address • Chunk size • I/O operation (read/write) • Timestamp • Generate a partition for every process • Optimal partitioning is NP-complete, so we develop a greedy algorithm • We have found we can use partial profiles to guide partitioning April 2005

Greedy File Partitioning Algorithm for each IO process, create a partition; for each contiguous data chunk { total up the # of read/write accesses on a process-ID basis; if the chunk is accessed by only one process assign the chunk to the associated partition; if the chunk is read (but never written) by multiple processes duplicate the chunk in all partitions where read; if the chunk is written by one process, but later read by multiple { assign the chunk to all partitions where read and broadcast the updates on writes; else assign the chunk to a shared partition; }} For each partition sort chunks based on the earliest timestamp for each chunk; April 2005

Parallel I/O Workloads • NASA Parallel Benchmark (NPB2.4)/BT • Computational fluid dynamics • Generates a file (~1.6 GB) dynamically and then reads it back • Writes/reads sequentially in chunk sizes of 2040 Bytes • SPEChpc96/seismic • Seismic processing • Generates a file (~1.5 GB) dynamically and then reads it back • Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB • Tile-IO • Parallel Benchmarking Consortium • Tile access to a two-dimensional matrix (~1 GB) with overlap • Writes/reads sequential chunks of 32 KB, with 2KB of overlap • Perf • Parallel I/O test program within MPICH • Writes a 1 MB chunk at a location determined by rank, no overlap • Mandelbrot • An image processing application that includes visualization • Chunk size is dependent on the number of processes April 2005

RAID Node Beowulf Cluster P2-350Mhz P2-350Mhz P2-350Mhz 10/100Mb Ethernet Switch Local PCI-IDE Disk Local PCI-IDE Disk P2-350Mhz P2-350Mhz P2-350Mhz RAID Node April 2005

DAS configuration Linux box, Western Digital WD800BB (IDE), 80GB, 7200RPM Beowulf cluster (base configuration) Fast Ethernet 100Mbits/sec Network Attached RAID - Morstor TF200 with 6-9GB drives Seagate SCSI disks, 7200rpm, RAID-5 Local attached IDE disks – IBM UltraATA-350840, 5400rpm Fibre channel disks Seagate Cheetah X15 ST-336752FC, 15000rpm Hardware Specifics April 2005

Write/Read Bandwidth NPB2.4/BT SPECHPC/seis April 2005

MPI-Tile Perf Write/Read Bandwidth Mandelbrot April 2005

April 2005

Profile training sensitivity analysis • We have found that IO access patterns are independent of file-based data values • When we increase the problem size or reduce the number of processes, either: • the number of IOs increases, but access patterns and chunk size remain the same (SPEChpc96, Mandelbrot), or • the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, Tile-IO, Perf) • Re-profiling can be avoided April 2005

Execution-driven Parallel I/O Modeling • Growing need to process large, complex datasets in high performance parallel computing applications • Efficient implementation of storage architectures can significantly improve system performance • An accurate simulation environment for users to test and evaluate different storage architectures and applications April 2005

Execution-driven I/O Modeling • Target applications: parallel scientific programs (MPI) • Target machine/Host machine: Beowulf clusters • Use DiskSim as the underlying disk drive simulator • Direct execution to model CPU and network communication • We execute the real parallel I/O accesses and meanwhile, calculate the simulated I/O response time April 2005

Validation – Synthetic I/O Workload on DAS April 2005

Local I/O traces Local I/O traces Local I/O traces Local I/O traces LAN/WAN Network File System RAID controller Filesystem metadata Logical file access addresses I/O traces I/O requests Disk Sim Simulation Framework - NAS April 2005

April 2005

LAN/WAN FileSystem FileSystem FileSystem FileSystem I/O traces I/O traces I/O traces I/O traces Disk Sim Disk Sim Disk Sim Disk Sim • A variety of SAN where disks are distributed across the network and each • server is directly connected to a single device • File partitioning • Utilize I/O profiling and data partitioning heuristics to distribute portions of • files to disks close to the processing nodes Simulation Framework – SAN direct April 2005

April 2005

Hardware Specifications April 2005

April 2005

Publications • “Profile-guided File Partitioning on Beowulf Clusters,” Journal of Cluster Computing, Special Issue on Parallel I/O, to appear 2005. • “Execution-Driven Simulation of Network Storage Systems,” Proceedings of the 12th ACM/IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, pp. 604-611. • “Profile-Guided I/O Partitioning,” Proceedings of the 17th ACM International Symposium on Supercomputing, June 2003, pp. 252-260. • “Source Level Transformations to Apply I/O Data Partitioning,” Proceedings of the IEEE Workshop on Storage Network Architecture And Parallel IO, Oct. 2003, pp. 12-21. • “Profile-Based Characterization and Tuning for Subsurface Sensing and Imaging Applications,” International Journal of Systems, Science and Technology, September 2002, pp. 40-55. April 2005

Summary of Cluster-based Work • Many imaging applications are dominated by file-based I/O • Parallel systems can only be effectively utilized if I/O is also parallelized • Developed a profile-guided approach to I/O data partitioning • Impacting clinical trials at MGH • Reduced overall execution time by 27-82% over MPI-IO • Execution-driven I/O model is highly accurate and provides significant modeling flexibility April 2005

I/O Qualification Laboratory • Working with Enterprise Strategy Group • Develop a state-of-the-art facility to provide independent performance qualification of Enterprise Storage systems • Provide a quarterly report to ES customer base on the status of current ES offerings • Work with leading ES vendors to provide them with custom early performance evaluation of their beta products April 2005

I/O Qualification Laboratory • Contacted by IOIntegrity and SANGATE for product qualification • Developed potential partners that are leaders in the ES field • Initial proposals already reviewed by IBM, Hitachi and other ES vendors • Looking for initial endorsement from industry April 2005

I/O Qualification Laboratory • Why @ NU • Track record with industry (EMC, IBM, Sun) • Experience with benchmarking and IO characterization • Interesting set of applications (medical, environmental, etc.) • Great opportunity to work within the cooperative education model April 2005

Internet RAID 100Mbit/s 1Gbit/s 31 sub-nodes 8 sub-nodes joulian.hpcl.neu.edu keys.ece.neu.edu Areas for Future Work • Designing a Peer-to-Peer storage system on a Grid system by partitioning datasets across geographically distributed storage devices Head node Head node April 2005

April 2005

Areas for Future Work • Reduce simulation time by identifying characteristic “phases” in I/O workloads • Apply machine learning algorithms to identify clusters of representative I/O behavior • Utilize K-Means and Multinomial clustering to obtain high fidelity in simulation runs utilizing sampled I/O behavior “A Multinomial Clustering Model for Fast Simulation of Architecture Designs”, submitted to the 2005 ACM KDD Conference. April 2005

Research @ Northeastern University

Research @ Northeastern University

Presentation Transcript

Northeastern University

Business Plan Northeastern University

Carmine Vittoria Prof., Northeastern University

Admissions at Northeastern University

Northeastern State University

Northeastern State University

Northeastern State University

Northeastern State university

Northeastern University

Donghui Zhang CCIS, Northeastern University ccs.neu/home/donghui/research/neustore/

Northeastern University

ENGINEERS WITHOUT BORDERS NORTHEASTERN UNIVERSITY

Northeastern Illinois University

Northeastern University

Northeastern University

Northeastern University IRB

Northeastern University 2015 Medical Benefits

Northeastern University, Multi-hop Tutorial