Supercomputing versus Big Data processing — What's the difference?

Supercomputing versus Big Data processing —What's the difference? Helmut Neukirchen helmut@hi.is Professor for Computer Science and Software Engineering

The Big Data buzz • Google search requests 1/2004–10/2016 “Supercomputer” vs. “Big Data”: “Big Data” “Supercomputer”

Excursion: Moore’s Law • “Number of transistors in an integrated circuit doubles every two years.” • Clock speed & performance per clock cycle doubled each as well every two years. • Not true anymore! http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/

Consequences of hitting physical limits • Today’s only way to achieve speed: • Parallel processing: • Many cores per CPU, • Many CPU nodes. • Both, Big Data processing and Supercomputing use this approach. • Investigate them to see their difference! https://hpc.postech.ac.kr/~jangwoo/research/research.html

Supercomputing / High-Performance Computing (HPC) • Computationally intensive problems. Mainly: • Floating Point Operations (FLOP), • Numerical problems, e.g. weather forecast. • HPC algorithms implemented rather low-level(=close to hardware/fast): • Programming languages: Fortran, C/C++. • Explicit intermediate results exchange. • Input & output data processed by a node fit typically into its main memory (RAM). • Output of similar size as input. http://www.vedur.is/vedur/frodleikur/greinar/nr/3226 https://www.quora.com/topic/ Message-Passing-Interface-MPI

HPC hardware • Compute nodes: fast CPUs. • Nodes connected via fast interconnects (e.g. InfiniBand). • Parallel File System storage: accessed by compute nodes via interconnnect. • Many hard disks in parallel (RAID): high aggregated bandwidth. • Expensive, but needed for highest performance of HPC processing model: • Read input once, compute & exchange intermediate results, write final result. Supercomputer at Icelandic Meteorological Office, owned by Danish Meteorological Institute Thor: 100 Tera FLOP/s Freya: 100 Tera FLOP/s Storage 1500 Tera Byte For comparison: Garpur @ Reiknistofnun Háskóla Íslands: 37 Tera FLOP/s http://www.dmi.dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/

Big Data • Data created in the age of Internet: • Volume (amount of data), • Unlikely to fit into main memory (RAM) of cluster. • Need to process data chunk by chunk. • Extract condensed summary as output. • Variety (range of data types and sources), • Velocity (speed of data in and out). https://youtu.be/H7NLECdBnps

http://www.semantic-evolution.com Big Data processing • Typically, simple operations instead of number crunching. • E.g. search engine crawling the web: index words & links on web pages. • Algorithms require not much intermediate results exchange. • Input/Output (I/O) of data most time consuming. • Computation and communication less critical. • Big Data algorithms can be implemented rather high-level: • Programming languages: Java, Scala, Python. • Big Data platforms: Apache Hadoop, Apache Spark: • Automatically read new data chunks, • Automatically execute algorithm implementation in parallel, • Automatically exchange intermediate results as needed.

Big Data hardware https://www.flickr.com/photos/cmnit/ 2040385443mantic-evolution.com • Cheap standard PC nodes withlocal storage, Ethernet network. • Distributed File System: each node stores locally a part of the whole data. • Hadoop/Spark move processing of data to where the data is locally stored. • Slow network connection not critical. • Cheap hardware more likely to fail:Hadoop and Spark are fault tolerant. • Processing model: read chunk of local data, process chunk locally, repeat; finally: combine and write result.

HPC vs. Big Data • We need both – HPC and Big Data Processing: • Do not run compute/communication intensive HPC jobs on Big Data cluster: • Slower CPUs, • Slower communication, • Slower high-level implementations. • Do not run Big Data jobs on HPC cluster: • Typically slower (fast local access missing), • Waste of money to use expensive HPC hardware.

HPC and Big Data @ HÍ • Research & teach both at Computer Science department: • Guest Prof. Dr. Morris Riedel, Prof. Dr. Helmut Neukirchen: • HPC: REI101F High Performance Computing A. • Big Data: REI102F High Performance Computing B, TÖL503M/TÖL102F Distributed Systems. • By inventing clever algorithms, HPC/Big Data not even needed. • 15:45–16:00, Askja 131: Páll Melsted: “Kallisto: hvernig RNA greining sem tók hálfan dag tekur nú 5 mínútur” • Thank your for your attention! Any questions or comments?

Supercomputing versus Big Data processing — What's the difference?

Supercomputing versus Big Data processing — What's the difference?

Presentation Transcript