Architectures and Systems for Parallel and Distributed Computing

Big Data Technologies Lecture 2:Architectures and systems for parallel and distributed computing Assoc. Prof. Marc FRÎNCU, Phd. Habil. marc.frincu@e-uvt.ro

What is a parallel architecture ”A collectionof processingelements that cooperateto solvelargeproblemsfast” – Almasiand Gottlieb, 1989 • ARM11 (2002-2005) • 90% of the embedded systems is based on ARM processors • Rasberry Pi • Pipeline in 8 steps • 8 in-flight instructions (out-of-order) • Intel Pentium 4 (2000-2008) • 124 in-flight instructions • Pipeline in 31 steps • Superscalar: in processor instruction level parallelism • Intel Skylake (august 2015) • Quad core • 2 threads per core • GPU • Tianhe-2 (2013) • 32.000 Intel Xeon CPUs with12 cores • 48.000 Intel Xeon Phi CPUs with57 cores • 34 petaflops • Sunway Taihulight • First placein the world in November 2017 • 40.960 SW26010 CPUs with 256 cores • 125 petaflops • Google Cloud • Cluster farms • 10 Tbps bandwidth (US-Japan)

Motivation • The performance of sequential systems is limited • Computation/data transfer through logic gates, memory devices • Latency <> 0 • Assuming an ideal environment we still have the limitation due to the light speed • Many applications requires performance • Nuclear reaction modelling • Climate modelling • Big Data • Google: 40.000 searches/min • Facebook: 1 billion users every day • Technological breakthrough • What do we do with so many transistors?

Exemples • Titan Supercomputer • 299,000AMD x86 cores • 186,000NVIDIA GPUs

Transistors vs. speed • More transistorspee chip equals more speed • Electronic devices that act as switches built to act as logical gates • The speed of each operation is given by the time required by each transistor to stop without causing any errors • A small transistor will stop/start faster • Exemple: • 3Ghz = 3 billion ops/sec • Increase density to increase speed • Intel P4: 170 milllion transistors • Intel 15-core Xeon IVY Bridge: 4.3 billion • On average, each year the processing power increased by 60%

Transistors vs. speed • Moore law • Ideally: No. transistors doubles each year (2x) • In reality: 5x every 5 years (1,35x)

Transistors vs. speed • Why can’t we increase speed forever? • Size of chip = constant • But, density of transistors increases • 2018: 10nm Intel Cannonlake • Dennard’s scalability • The power (V) needed to operate the transistors = constant even if the number of transistors per chip increases • But this is no longer true (as of 2006)! • The transistors is are becoming so small that their integrity breaks and they leak current • The faster we turn off/on a transistor the more heat it generates • For 8,5-9 GHz we require liquid nitrogen!

How do we reduce the power consumption? • Frequency∝voltage • Gate energy ∝ voltage2 Executing the same number of cycles at a lower voltage andspeedpower economy • Example • Task with deadline of 100ms • Method #1: • 50ms at full speed then 50 ms idle • Method #2: • 100ms at frequency/2 and voltage/2 • Energy requirement: energy = voltage/44x cheaper

Sequential processing • 6 hours to wash 4 rows of clothes

Pipeline sequential processing • Pipeline= start processing IMMEDIATELY • Improves system throughput • Multiple tasks operatein parallel • Reduces time to 3.5 hours

Instruction level parallelism (ILP)

CPU pipeline

Conditional pipeline (branching) • What happens when we have dependencies between instructions? • Especially for if-else branching • The processor must erase instructions fetched through the pipeline since reaching a branch means they were incorrectly fetched, i.e., we must fetch the instructions corresponding to the chosen branch. • AMD Ryzen (2017)uses neural networks topredictthe execution path generate address fetch operands fetch decode execute store i+2 is a branch instruction: execute either j or i+3 https://courses.cs.washington.edu/courses/csep548/00sp/handouts/lilja.pdf

Superscalar architectures • Executemore than one instruction per CPU cycle • Send multiple instructions to redundant units • Hyperthreading (Intel) • Example • Intel P4 • One instruction (thread) processesintegers (ALU unit for integers) another processes floating point numbers (ALU unit for floats) • The OS thinksit deals with 2 processors • Is accomplished by combininga series of shared, replicated or partitioned resources: • Registers • Arithmetic units • Cache memory

Amdahl’s law • How much can we improve by parallelizing? • Example • Floating point instructions • Theoretical speedup: 2x • Percentage of the total #instructions:10% • 1,053

End of the single core? • Frequency stopped scalability • Dennard’s scalability • Memory wall • Data and instruction must be fetched to the registries (cache) • Memorybecomes the critical point • The ILP wall • Dependencies between instructions limit the ILP efficiency

Multi-core solution • More cores on the same CPU • Better than hyperthreading • Real parallelism • Example • Reducing speed (frequency) by 30%  reduces power by 35% • Power ∝ frequency3 (or worse) • But performance is also reduced by 30% • Having 2 cores per chip at 70% speed  140% of the original performance at 70% of the power • 40% increase in power at 30% savings in energy

IBM power4 • Introduced în 2001

Multicore • Intel i7 • 6 cores • 12 threads • Parallelism through hyperthreading

Different architectures • Intel Nehalem (Atom, i3, i5, i7) • Cores are linked through the QuickPathInterconnect (QPI) • Mesh architecture (all with all) • 25,6 GB/s • Older Intel versions (Penryn) • Cores are linked through the FSB (Front Side Bus) • Sequential architecture • 3,2 GB/S (P4) – 12,8 GB/s (Core 2 Extreme)

AMD Infinity Fabric • HyperTransport protocol • 30-50 GB/s • 512 GB/s for GPU Vega • Mesh network • Network on a chip, clustering • Link between GPUs and SoC • CCIXstandard: accelerators, FPGA-uri

Many core • Systems with tens or hundreds of cores • Developed forparallel computing • High Throughputand low energy consumption (sacrifices latency) • Problems such ascache coherencein multi-core systems (few cores) • They use: mesage passing, DMA, PGAS (Partitioned Global Access Space) • Not efficient for applications using just one thread • Example • Xeon Phi with 59-72 cores • GPUs: Tesla K80 with 4992 CUDA cores

CPU vs. GPU architecture • GPU has more transistors for computations

Intel Xeon Phi architecture

Parallel processing models Flynn classification • SISD (Single Instruction Single Data) • Uniprocesor • MISD (Multiple Instruction Single Data) • Multiple processors on a single data stream • Sistolic processor • Stream processor • SIMD (Single Instruction Multiple Data) • Same instruction is executed on multiple processors • Each processor has its own memory (different data) • Shared memory for control and instructions • Good for data level parallelism • Vector processor • GPUs(partialy) • MIMD (Multiple Instruction Multiple Data) • Each processor has its own data and instructions • Multiprocessor • Processor multithread

Centralized shared memory Multiprocessors • Symmetric MultiProcessor (SMP) • Processors areinterconectedthrough a UMA (Uniform Memory Access) backbone • Do no scale well

Distributed shared memory multiprocessors • SMP clusters • NUMA (Non Uniform Memory Access) • Physical memory for each processor (but address space is shared) • Avoids starvation: memory is accessed by one processor at a time • AMD processors implement the model through HyperTransport (2003), and Intel through QMI (2007)

Distributed shared memory multiprocessors • Enhanced scalability • Low latency for local access • Scalable memory bandwidth at low costs • But • Higher interprocessor communication times • Network technology is vital! • Complex software model

Message passing multiprocessors • Multicomputers • Communication is based onmessage passing between processorsand not shared memory access • Can call remote methods: Remote Procedure Call (RPC) • Libraries: MPI (Message Passing Interface) • Synchronous communication • Causes process synchronization • Address space allocated private addresses to each distinct processors • Example • Massive Parallel Processing (MPP) • IBM Bluegene • Clusters • Created by linked computers in a LAN

Shared memory vs. message passing • Shared memory • Easy programming • Hides the network but does not hide its latency • Hardware controlled software • To reduce communication overhead • Message passing • Explicitcommunication • Can be optimized • Natural synchronization • When sending messages • Programming is challenging as it must consider aspects hidden by the shared memory systems • Transport cost

Distributed systems • “A collectionof (probably heterogeneous) automata whose distributionis transparent to the user so that the system appears as one local machine. This is in contrast to a network, where the user is aware that there are several machines, and their location, storage replication, load balancing and functionality is not transparent. Distributed systems usually use some kind of client-server organization.” – FOLDOC • “A Distributed System comprises several single components on different computers, which normally do not operate using shared memory and as a consequence communicatevia the exchange of messages. The various componentsinvolved cooperateto achieve a common objective such as the performing of a business process.” – Schill & Springer

Parallel vs. distributed systems • Parallelism • Executing multiple tasks at the same time • True parallelism requires having as many cores as parallel tasks • Concurrent execution • Thread based computing • Can use hardware parallelism but usually derives from software requirements • Example: the effects of multiple system calls • Become parallelism if parallelism is true • Distributed computing • Refers to where the computation is performed • Computers are linked in a network • Memory is distributed • Is usually part of the objective • If resources are distributed then we have a distributed system • Raises many problems from a programming point of view • No global clock, synchronization, unpredicted errors, variable latency, security, interoperatibiliy.

Distributed systems models • Miniccomputer • Workstation • Server workstation • Processor pool • Cluster • Grid • Cloud

Mini- computer Mini- computer Mini- computer Minicomputer • Extension of time sharing systems • The user logs on the machine • Authenticates remotely though telnet • Shared resources • Data bases • HPC ARPA net

Workstation • Process migration • The user authenticates on the machine • If any networked resources are available then the process migrates there • Issues • How do you identify available resources? • How do we migrate a process? • What happens if another user logs on the available resource? Workstation Workstation Workstation 100Gbps LAN Workstation Workstation

Workstation – server • Client stations • No hard memory • Interactive/graphical processes are executed locally • All files and computations are senton the server • Servers • Each machine is dedicatedto a certain type of job • Communication model • RPC (Remote Procedure Call) • C • RMI (Remote Method Invocation) • Java • A client processinvokesa server process • There is no migration between machines Workstation Workstation Workstation 100Gbps LAN Mini- Computer file server Mini- Computer http server Mini- Computer cycle server

Processor pool • Client • User authenticates on remote machine • Allservicesare sentto servers • Server • Allocates the required number of processorsto each client • Better usage but less interaction 100Gbps LAN Server 1 Server N

Cluster Workstation • Client • Client-server model • Server • Consists of many interconnected machines through a high speed network • The aim is performance • Parallel processing Workstation Workstation 100Gbps LAN http server2 http server N http server1 Slave N Master node Slave 1 Slave 2 1Gbps SAN

Grid Workstation • Aim • Collect processingpowerfrom many clusters or parallel systems and make it availableto users • Similar in concept with a power grid • You just buy what you use • HPC distributed computing • Large problems requiring many resources • On demand • Remote resources are integrated with local ones • Big Data • Data is distributed • Shared computing • Communication between resources Super- computer High-speed Information high way Mini- computer Cluster Super- computer Cluster Workstation Workstation

cloud Workstation • Distributed systems where access to resources is virtualizedand on demand, while keeping the topology hidden • Per per use (per second, GB, query, etc.) • Access levels • Infrastructure (Iaas) • Platform (Paas) • Services (Saas) • Data (DaaS) • Amazon EC2 • Google Compute Cloud • Microsoft Azure Internet Specific services VM VM VM Database Workstation Workstation

summary • Shared memory • Homogeneous resources • Nsecaccess • Message passing • Homogeneous/heterogeneous resources • Microsecaccess • Distributed systems • Heterogeneous resources • Msec access

Sources • http://slideplayer.com/slide/5704113/ • https://www.comsol.com/blogs/havent-cpu-clock-speeds-increased-last-years/ • http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture01-intro.pdf • https://www.slideshare.net/DilumBandara/11-performance-enhancements • http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture15.pdf • http://www.ni.com/white-paper/11266/en/ • https://www.cs.virginia.edu/~skadron/cs433_s09_processors/arm11.pdf • http://www.csie.nuk.edu.tw/~wuch/course/eef011/4p/eef011-6.pdf

Next lecture • Parallelizing algorithms • APIsand platformsfor parallel and distributed computing • OpenMP • MPI • Uniform Parallel C • CUDA • Hadoop

Architectures and Systems for Parallel and Distributed Computing