Comparison of Big Data versus High-Performance Computing: Some Observations

Comparison of Big Data versusHigh-Performance Computing: Some Observations Helmut Neukirchen University of Iceland helmut@hi.is Research performed as visiting scientist at Jülich Supercomputing Centre (JSC). Computing time granted by JSC. Thanks to Morris Riedel.

Use High-Performance Computing or Big Data processing? • Standard approach for computationally intensive problems: High-Performance Computing (HPC). • Low-level C/C++/Fortran parallel processing implementations, • Low-level (Message Passing Interface MPI) send/receive communication, • Fast interconnects (e.g. InfiniBand). • Fast central storage (RAID array attached via interconnect). • Is parallel processing offered by the Big Data framework Apache Spark a competitive alternative to HPC? • High-level Java/Scala/Python implementations, • Convenient high-level programming model (serial view, implicit communication), • Distributed HDFS file system, slow Ethernet. • Move processing to where the data is locally stored (mitigates slow communication). 2

Spark against HPC – a practical experiment: Hardware • Cluster JUDGE at Jülich Supercomputing Centre. • 39 executor nodes: • each 42 GB usable RAM, • shared by 24 hyperthreads. • (2 Intel Xeon X5650: each 6 cores=12 hyperthreads each), • Totalling in 39*24=936 hyperthread cores. • Connected via InfiniBand (for HPC) and GB Ethernet (for Spark). • For Spark: local hard disks: each 222.9 MB/s peak, totalling 8.7 GB/s in parallel. HDFS replication factor 2, 128 MB blocks. • For HPC: Storage cluster JUST connected via InfiniBand: 160 GB/s peak.

Spark against HPC – a practical experiment: Application Ester, Kriegel, Sander, Xu, “Density-based spatial clustering of applications with noise” Proc. Second Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996. • Density-based spatial clustering of applications with noise(DBSCAN). • Detects arbitrarily shaped clusters, • Detects and can filter noise, • No need to know number of clusters. • Two parameters: • Spatial search radius ε, • Point density minPts. • At least minPts elements needed within ε radius to form a cluster. • Otherwise considered noise. 4

DBSCAN: Properties • Simple distance calculations (=more like big data), but still floating point (=more like HPC). • Compare each of the n points with each of the remaining n-1 points to see whether their distance is ≤ ε  O(n²). • Spatially sorted data structures (R-trees, R*-trees, kd-trees): compare each of the n points with spatially close points only. O(n log n). • No permanent intermediate results exchange(=more like big data), but still strong relationship between data (=more like HPC).

DBSCAN: Parallelisation • Originally, formulated as a non-parallel (“serial”) algorithm. • Clustering itself independently possible in parallel: 1. Re-shuffle all data to have it spatially sorted (R-trees, R*-trees, kd-trees) to ease decomposition of input domain into boxes, 2. Independent clustering of boxes in parallel,(ε overlap at box boundaries needed to deal with points in the border area of the neighbour boxes: “ghost” or “halo” regions). 3. Final result exchange(between neighbour boxes)to merge clusters spanning multiple boxes. 6

Spark against HPC – a practical experiment:Benchmarked DBSCAN Implementations • HPC (MPI and OpenMP): C++ • HPDBSCAN, arbitrary dimensions, O(n log n),https://bitbucket.org/markus.goetz/hpdbscan. • Spark: Scala/JVM, all 2D only: • Spark DBSCAN, O(n²), https://github.com/alitouka/spark_dbscan, • Spark_DBSCAN, O(n²), https://github.com/aizook/SparkAI, • DBSCAN On Spark, https://github.com/mraad/dbscan-spark. • Does in fact implement only an approximation of DBSCAN (square boxes of domain decompostion cells used as density instead of εsearch radius and halos) yielding completely different (=wrong) clusters. • RDD-DBSCAN, O(n log n), https://github.com/irvingc/dbscan-on-spark. • Serial Java/JVM implementation for comparison: • ELKI 0.7.1, O(n log n) using R* tree, https://elki-project.github.io.

HPDBSCAN: Domain Decomposition Highly Parallel DBSCAN: Decompose input data into ε size cells. Load balancing between processors based on comparison costs (= # of distance calculations). Costs per cell := #points * #neighbours Supports different number of cells assigned to each processor to achieve approx. same amount of comparisons per processor. ε halos at processor boundaries. Götz, Bodenstein, Riedel, “HPDBSCAN: highly parallel DBSCAN,” Proc. Workshop on Machine Learning in High-Performance Computing Environments, in conjunction with Super Computing 2015, ACM. 8 8

Cordova, Moh, “DBSCAN on Resilient Distributed Datasets,” in 2015 Int. Conf. on High Performance Computing & Simulation (HPCS), IEEE. RDD-DBSCAN: Domain Decomposition • Load balancing based on number of points: http://www.irvingc.com/visualizing-dbscan 1. Initial data space. 2. Recursive horizontal or vertical split of data space into boxes containing same number of points. Boxes do not get smaller than ε. 3. Grow boxes by ε on each of the 4 sides to achieve overlap.

Spark against HPC – a practical experiment: Dataset used in Benchmark • Geo-tagged tweets of covering United Kingdom. • Heavily skewed, e.g. most tweets located in London. • Trivia: Twitter spam noise: geo-tagged spam with nonsense locations. • DBSCAN parameters: ε =0.01, minPts=40. • Will return as clusters locations where people tweet a lot (e.g. tourist spots, cities, but also roads/train tracks, ferries across the channel). • Size & file format: • 3 704 351 (longitude, latitude floating point) data points. • Not reallybig data: • 57 MB in HDF5 binary HPC format, • Spark does not support binary formats very well, needed to convert to CSV:67 MB in CSV textual format for Spark (=fits into 1 HDFS block). http://hdl.handle.net/11304/6eacaa76-c275-11e4-ac7e-860aa0063d1f

Measurements: numbers Neukirchen: “Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms”, Technical Report VHI-01-2016, University of Iceland, November 2016.

Measurements: charts seconds Scalability of HPDBSCAN cores 12

Interpretation of Measurements & Implementation • Benchmark of O(n²) implementations aborted (far too slow), • Data is heavily skewed (high-density in London): • Domain decomposition of RDD-DBSCAN cannot compete with HPDBSCAN: RDD partitions not equally filled: • While almost all executors have finished their work, there are a few long running tasks remaining (that process those boxes containing a lot of data points): 935 cores idle, but 1 core busy for 5 further minutes... In fact that high-density box takes so long that 57 cores are enough for the rest. • C++ is ≈9 times faster than Java/JVM. • Spark Scala/JVM RDD-DBSCAN on 1 core ≈7 times slower than optimised serial Java/JVM ELKI.

Conclusions • What matters for HPC, still applies to Apache Spark: • Implementation complexity matters, • Domain decomposition/load balancing matters. • HPC faster than Apache Spark: • Java/Scala significantly slower than C/C++, • Unfortunately, no C/C++ big data frameworks available. • HPC I/O is typically faster, • Even though non-local: fast RAID and fast interconnects. • Automated Spark parallel processing not as good as handcrafted HPC. • Binary data formats not well supported by big data frameworks. • But: • HPC hardware far more expensive, • Spark runs are fault tolerant! • More efforts to implement low-level HPC code than high-level Spark code!

Thank you for your attention! • Any questions?

Supercomputing / High-Performance Computing (HPC) Computationally intensive problems. Mainly: Floating Point Operations (FLOP). HPC algorithms implemented rather low-level(=close to hardware/fast): Programming languages: Fortran, C/C++. Explicit intermediate results exchange (MPI). Input & output data processed by a node fit typically into its main memory (RAM). 16 16 http://www.vedur.is/vedur/frodleikur/greinar/nr/3226 https://www.quora.com/topic/ Message-Passing-Interface-MPI

HPC hardware Compute nodes: fast CPUs. Nodes connected via fast interconnects (e.g. InfiniBand). Parallel File System storage: accessed by compute nodes via interconnnect. Many hard disks in parallel (RAID): high aggregated bandwidth. Very Expensive, but needed for highest performance of HPC processing model: Read input once, compute & exchange intermediate results, write final result. Compute nodes Compute nodes Storage 17 17 http://www.dmi.dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/

Big Data processing Typically, simple operations instead of number crunching. E.g. search engine crawling the web: index words & links on web pages. Algorithms require not much intermediate results exchange. Input/Output (I/O) of data most time consuming. Computation and communication less critical. Big Data algorithms can be implemented rather high-level: Programming languages: Java, Scala, Python. Big Data platform: Apache Spark (in the past: Apache Hadoop/Map Reduce): Automatically read new data chunks, Automatically execute algorithm implementation in parallel, Automatically exchange intermediate results as needed. http://www.semantic-evolution.com 18

Big Data hardware Cheap standard PC nodes withlocal storage, Ethernet network. Distributed File System (HDFS): each node stores locally a part of the whole data. Hadoop/Spark move processing of data to where the data is locally stored. Slow network connection not critical. Cheap hardware more likely to fail:Hadoop and Spark are fault tolerant. Processing model: read chunk of local data, process chunk locally, repeat; finally: combine and write result. https://www.flickr.com/photos/cmnit/ 2040385443mantic-evolution.com 19

Comparison of Big Data versus High-Performance Computing: Some Observations