Improved Performance in Data-Intensive Computational Turbulence with I/O Streaming

I/O Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov, Eric Perlman, Randal Burns, Yanif Ahmad, and Alexander Szalay Johns Hopkins University

I/O Streaming For Batch Queries • Based on partial sums • Allows access to the underlying data in any order and in parts • Data streamed from disk in a single pass • Eliminates redundant I/O • Over an order of magnitude improvement in performance over direct evaluation of queries

Introduction • Data-intensive computing breakthroughs have allowed for new interaction with scientific numerical simulations • Formerly, analysis performed during the computation • No data stored for subsequent examination • Turbulence Database Cluster • Stores entire space-time evolution of the simulation • Two datasets totaling 70TB; part of the 1.1PB GrayWulf cluster • Provides public access to world-class simulation • Implements “immersive turbulence*” approach *E. Perlman, R. Burns, Y. Li, and C. Meneveau. Data exploration of turbulence simulations using a database cluster. In Supercomputing, 2007.

Turbulence Database Cluster

Motivation • Without I/O streaming: • Heavy DB usage slows down the service by a factor of 10 to 20 • Query evaluation techniques adapted from simulation code do not access data coherently • Substantial storage overhead (~42%) incurred to localize each computation • Turbulence queries: • 95% of queries perform Lagrange Polynomial interpolation • Can be evaluated in parts

Processing a Batch Query 10 11 14 15 8 9 12 13 2 3 6 7 0 1 4 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Processing a Batch Query query 2 10 11 14 15 • Redundant I/O • Multiple disk seeks 8 9 12 13 2 3 6 7 0 1 4 5 query 1 query 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 6 8 9 12 q1: 9 11 12 14 q2: q3: 4 5 6 7

Streaming Evaluation Method • Linear data requirements of the computation allow for: • Incremental evaluation • Streaming over the data • Concurrent evaluation of batch queries

Processing a Batch Query query 2 10 11 14 15 • Sequential I/O • Single pass 8 9 12 13 2 3 6 7 0 1 4 5 query 1 query 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 11 12 14 I/O Streaming: q1 q1 q1 q1 q1 q3 q1 q3 q1 q1 q2 q1 q2 q2 q3 q3 q2

Lagrange Polynomial Interpolation Lagrange coefficients Data

Processing a Batch Query • Input queries pre-processed into a key-value dictionary • Keys are z-index values of data atoms stored in DB • Entries are lists of queries • Temp table is created out of dictionary keys • Execute a join between temp table and data table • When data atom is read-in all queries that need data from it are processed and their partial sums updated

Experimental Evaluation • Random workloads: • across the entire cube space • a 1283 subset of the entire space • Workload from the usage log of the Turbulence cluster • Compare with direct methods of evaluation: • Direct • Sorting • Join/Order By

3D Workload • Used for generating global statistics

128 Workload • Used for: • Examining ROI • Creating visualizations

Experimental Evaluation • Random workloads: • across the entire cube space • a 1283 subset of the entire space • Workload from the usage log of the Turbulence cluster • Compare with direct methods of evaluation: • Direct • Sorting • Join/Order By

Setup • Experimental version of the MHD database • ~300 timesteps of the velocity fields of the MHD simulation • Two 2.33 GHz dual quad-core Windows 2003 servers with SQL Server 2008 and 8GB of memory • Part of the 1.1PB GrayWulf cluster with aggregate low-level throughput of 70 GB/sec • Data tables striped across 7 disks per node

3D Workload • I/O Streaming • Each atom is read only once • Effective cache usage • Join/Order By executes entire batch as a join • Sorting leads to a more sequential acces • Over an order of magnitude improvement

128 Workload • Less I/O • More data sharing

I/O Streaming alleviates I/O bottleneck • Computation emerges as the more costly operation

128 Workload

Future Work • Extend I/O streaming technique to other decomposable kernel computations: • Differentiation • Temporal interpolation • Filtering • Multi-job batch scheduling: • Integrate into a batch scheduling framework such as JAWS* *X. Wang, E. Perlman, R. Burns, T. Malik, T. Budavari, C. Meneveau, and A. Szalay. Jaws: Job-aware workload scheduling for the exploration of turbulence simulations. In Supercomputing, 2010.

Summary • I/O Streaming method for data-intensive batch queries • Single pass by means of partial-sums • Effective exploitation of data sharing • Improved cache locality • Over an order of magnitude improvement in performance

Questions Images courtesy of Kai Buerger (buerger@tum.de)

Improved Performance in Data-Intensive Computational Turbulence with I/O Streaming