Big Data Vs. (Traditional) HPC

Big Data Vs. (Traditional) HPC GaganAgrawal Ohio State ICPP Big Data Panel (09/12/2012)

Big Data Vs. (Traditional) HPC • They will clearly co-exist • Fine-grained simulations will prompt more `big-data’ problems • Ability to analyze data will prompt finer-grained simulations • Even instrument data can prompt more simulations • Third and Fourth Pillars of Scientific Research • Critical Need • HPC community must get very engaged in `big-data’ ICPP Big Data Panel (09/12/2012)

Other Thoughts • Onus on HPC Community • Database, Cloud, and Viz communities active for a while now • Abstractions like MapReduce are neat! • So are Parallel and Streaming Visualization Solutions • Many existing solutions very low on performance • Do people realize how slow Hadoop really is? • And, yet, one of the most successful open source software? • We are needed! • Programming model design and implementation community hasn’t even looked at `big-data’ applications • We must engage application scientists • Who are often struck in `I don’t want to deal with the mess’ ICPP Big Data Panel (09/12/2012)

Impact on Leadership Class Systems • Unlike HPC, commercial Sector has a lot of experience in `Big-Data’ • Facebook, Google • They seem to do fine with large fault-tolerant commodity clusters • `Big-Data’ might create a push back from memory / I/O Bound architecture trends • Might make journey to Exascale harder though • `Big-data’ problems should certainly be considered while addressing fault-tolerance and power challenges ICPP Big Data Panel (09/12/2012)

Open Questions • How do we develop parallel data analysis solutions? • Hadoop? • MPI + file I/O calls? • SciDB – array analytics? • Parallel R? • Desiderata • No reloading of data (rules out SciDBand Hadoop) • Performance while implementing new algorithms (rules out parallel R) • Transparency with respect to data layouts and parallel architectures ICPP Big Data Panel (09/12/2012)

Our Ongoing Work: MATE++ • A very efficient Map-Reduce-like System for Scientific Data Analytics • MapReduce and another reduction based API • Can plug and play with different data formats • No reloading of data • Flexibly use different forms of parallelism • GPUs, Fusion Architecture … ICPP Big Data Panel (09/12/2012)

Data Management/Reduction Solutions • Must provide Server-side data sub-setting, aggregation and sampling • Without reloading data into a `system’ • Our Approach: Light-weight data management solutions • Automatic Data Virtualization • Support virtual (e.g. relational) view over NetCDF, HDF5 etc. • Support sub-setting and aggregation using a high-level language • A new sampling approach based on bit-vector • Create lower-resolutions representative datasets • Measure loss of information with respect to key statistical measures ICPP Big Data Panel (09/12/2012)

Big Data Vs. (Traditional) HPC