1 / 11

Supercomputing versus Big Data processing — What's the difference?

Explore the differences between supercomputing and big data processing, including their hardware, algorithms, and applications. Discover why both are essential in today's data-driven world.

renas
Download Presentation

Supercomputing versus Big Data processing — What's the difference?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supercomputing versus Big Data processing —What's the difference? Helmut Neukirchen helmut@hi.is Professor for Computer Science and Software Engineering

  2. The Big Data buzz • Google search requests 1/2004–10/2016 “Supercomputer” vs. “Big Data”: “Big Data” “Supercomputer”

  3. Excursion: Moore’s Law • “Number of transistors in an integrated circuit doubles every two years.” • Clock speed & performance per clock cycle doubled each as well every two years. • Not true anymore! http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/

  4. Consequences of hitting physical limits • Today’s only way to achieve speed: • Parallel processing: • Many cores per CPU, • Many CPU nodes. • Both, Big Data processing and Supercomputing use this approach. • Investigate them to see their difference! https://hpc.postech.ac.kr/~jangwoo/research/research.html

  5. Supercomputing / High-Performance Computing (HPC) • Computationally intensive problems. Mainly: • Floating Point Operations (FLOP), • Numerical problems, e.g. weather forecast. • HPC algorithms implemented rather low-level(=close to hardware/fast): • Programming languages: Fortran, C/C++. • Explicit intermediate results exchange. • Input & output data processed by a node fit typically into its main memory (RAM). • Output of similar size as input. http://www.vedur.is/vedur/frodleikur/greinar/nr/3226 https://www.quora.com/topic/ Message-Passing-Interface-MPI

  6. HPC hardware • Compute nodes: fast CPUs. • Nodes connected via fast interconnects (e.g. InfiniBand). • Parallel File System storage: accessed by compute nodes via interconnnect. • Many hard disks in parallel (RAID): high aggregated bandwidth. • Expensive, but needed for highest performance of HPC processing model: • Read input once, compute & exchange intermediate results, write final result. Supercomputer at Icelandic Meteorological Office, owned by Danish Meteorological Institute Thor: 100 Tera FLOP/s Freya: 100 Tera FLOP/s Storage 1500 Tera Byte For comparison: Garpur @ Reiknistofnun Háskóla Íslands: 37 Tera FLOP/s http://www.dmi.dk/nyheder/arkiv/nyheder-2016/marts/ny-supercomputer-i-island-en-billedfortaelling/

  7. Big Data • Data created in the age of Internet: • Volume (amount of data), • Unlikely to fit into main memory (RAM) of cluster. • Need to process data chunk by chunk. • Extract condensed summary as output. • Variety (range of data types and sources), • Velocity (speed of data in and out). https://youtu.be/H7NLECdBnps

  8. http://www.semantic-evolution.com Big Data processing • Typically, simple operations instead of number crunching. • E.g. search engine crawling the web: index words & links on web pages. • Algorithms require not much intermediate results exchange. • Input/Output (I/O) of data most time consuming. • Computation and communication less critical. • Big Data algorithms can be implemented rather high-level: • Programming languages: Java, Scala, Python. • Big Data platforms: Apache Hadoop, Apache Spark: • Automatically read new data chunks, • Automatically execute algorithm implementation in parallel, • Automatically exchange intermediate results as needed.

  9. Big Data hardware https://www.flickr.com/photos/cmnit/ 2040385443mantic-evolution.com • Cheap standard PC nodes withlocal storage, Ethernet network. • Distributed File System: each node stores locally a part of the whole data. • Hadoop/Spark move processing of data to where the data is locally stored. • Slow network connection not critical. • Cheap hardware more likely to fail:Hadoop and Spark are fault tolerant. • Processing model: read chunk of local data, process chunk locally, repeat; finally: combine and write result.

  10. HPC vs. Big Data • We need both – HPC and Big Data Processing: • Do not run compute/communication intensive HPC jobs on Big Data cluster: • Slower CPUs, • Slower communication, • Slower high-level implementations. • Do not run Big Data jobs on HPC cluster: • Typically slower (fast local access missing), • Waste of money to use expensive HPC hardware.

  11. HPC and Big Data @ HÍ • Research & teach both at Computer Science department: • Guest Prof. Dr. Morris Riedel, Prof. Dr. Helmut Neukirchen: • HPC: REI101F High Performance Computing A. • Big Data: REI102F High Performance Computing B, TÖL503M/TÖL102F Distributed Systems. • By inventing clever algorithms, HPC/Big Data not even needed. • 15:45–16:00, Askja 131: Páll Melsted: “Kallisto: hvernig RNA greining sem tók hálfan dag tekur nú 5 mínútur” • Thank your for your attention! Any questions or comments?

More Related