430 likes | 449 Views
Explore various parallel architectures and systems used in parallel and distributed computing, such as ARM11, Intel Pentium 4, Intel Skylake, Tianhe-2, and Google Cloud. Understand the limitations of sequential systems and the need for parallel processing. Learn about the challenges and advancements in reducing power consumption and increasing processing speed. Study concepts like pipeline processing, instruction-level parallelism, superscalar architectures, and Amdahl's law.
E N D
Big Data Technologies Lecture 2:Architectures and systems for parallel and distributed computing Assoc. Prof. Marc FRÎNCU, Phd. Habil. marc.frincu@e-uvt.ro
What is a parallel architecture ”A collectionof processingelements that cooperateto solvelargeproblemsfast” – Almasiand Gottlieb, 1989 • ARM11 (2002-2005) • 90% of the embedded systems is based on ARM processors • Rasberry Pi • Pipeline in 8 steps • 8 in-flight instructions (out-of-order) • Intel Pentium 4 (2000-2008) • 124 in-flight instructions • Pipeline in 31 steps • Superscalar: in processor instruction level parallelism • Intel Skylake (august 2015) • Quad core • 2 threads per core • GPU • Tianhe-2 (2013) • 32.000 Intel Xeon CPUs with12 cores • 48.000 Intel Xeon Phi CPUs with57 cores • 34 petaflops • Sunway Taihulight • First placein the world in November 2017 • 40.960 SW26010 CPUs with 256 cores • 125 petaflops • Google Cloud • Cluster farms • 10 Tbps bandwidth (US-Japan)
Motivation • The performance of sequential systems is limited • Computation/data transfer through logic gates, memory devices • Latency <> 0 • Assuming an ideal environment we still have the limitation due to the light speed • Many applications requires performance • Nuclear reaction modelling • Climate modelling • Big Data • Google: 40.000 searches/min • Facebook: 1 billion users every day • Technological breakthrough • What do we do with so many transistors?
Exemples • Titan Supercomputer • 299,000AMD x86 cores • 186,000NVIDIA GPUs
Transistors vs. speed • More transistorspee chip equals more speed • Electronic devices that act as switches built to act as logical gates • The speed of each operation is given by the time required by each transistor to stop without causing any errors • A small transistor will stop/start faster • Exemple: • 3Ghz = 3 billion ops/sec • Increase density to increase speed • Intel P4: 170 milllion transistors • Intel 15-core Xeon IVY Bridge: 4.3 billion • On average, each year the processing power increased by 60%
Transistors vs. speed • Moore law • Ideally: No. transistors doubles each year (2x) • In reality: 5x every 5 years (1,35x)
Transistors vs. speed • Why can’t we increase speed forever? • Size of chip = constant • But, density of transistors increases • 2018: 10nm Intel Cannonlake • Dennard’s scalability • The power (V) needed to operate the transistors = constant even if the number of transistors per chip increases • But this is no longer true (as of 2006)! • The transistors is are becoming so small that their integrity breaks and they leak current • The faster we turn off/on a transistor the more heat it generates • For 8,5-9 GHz we require liquid nitrogen!
How do we reduce the power consumption? • Frequency∝voltage • Gate energy ∝ voltage2 Executing the same number of cycles at a lower voltage andspeedpower economy • Example • Task with deadline of 100ms • Method #1: • 50ms at full speed then 50 ms idle • Method #2: • 100ms at frequency/2 and voltage/2 • Energy requirement: energy = voltage/44x cheaper
Sequential processing • 6 hours to wash 4 rows of clothes
Pipeline sequential processing • Pipeline= start processing IMMEDIATELY • Improves system throughput • Multiple tasks operatein parallel • Reduces time to 3.5 hours
Conditional pipeline (branching) • What happens when we have dependencies between instructions? • Especially for if-else branching • The processor must erase instructions fetched through the pipeline since reaching a branch means they were incorrectly fetched, i.e., we must fetch the instructions corresponding to the chosen branch. • AMD Ryzen (2017)uses neural networks topredictthe execution path generate address fetch operands fetch decode execute store i+2 is a branch instruction: execute either j or i+3 https://courses.cs.washington.edu/courses/csep548/00sp/handouts/lilja.pdf
Superscalar architectures • Executemore than one instruction per CPU cycle • Send multiple instructions to redundant units • Hyperthreading (Intel) • Example • Intel P4 • One instruction (thread) processesintegers (ALU unit for integers) another processes floating point numbers (ALU unit for floats) • The OS thinksit deals with 2 processors • Is accomplished by combininga series of shared, replicated or partitioned resources: • Registers • Arithmetic units • Cache memory
Amdahl’s law • How much can we improve by parallelizing? • Example • Floating point instructions • Theoretical speedup: 2x • Percentage of the total #instructions:10% • 1,053
End of the single core? • Frequency stopped scalability • Dennard’s scalability • Memory wall • Data and instruction must be fetched to the registries (cache) • Memorybecomes the critical point • The ILP wall • Dependencies between instructions limit the ILP efficiency
Multi-core solution • More cores on the same CPU • Better than hyperthreading • Real parallelism • Example • Reducing speed (frequency) by 30% reduces power by 35% • Power ∝ frequency3 (or worse) • But performance is also reduced by 30% • Having 2 cores per chip at 70% speed 140% of the original performance at 70% of the power • 40% increase in power at 30% savings in energy
IBM power4 • Introduced în 2001
Multicore • Intel i7 • 6 cores • 12 threads • Parallelism through hyperthreading
Different architectures • Intel Nehalem (Atom, i3, i5, i7) • Cores are linked through the QuickPathInterconnect (QPI) • Mesh architecture (all with all) • 25,6 GB/s • Older Intel versions (Penryn) • Cores are linked through the FSB (Front Side Bus) • Sequential architecture • 3,2 GB/S (P4) – 12,8 GB/s (Core 2 Extreme)
AMD Infinity Fabric • HyperTransport protocol • 30-50 GB/s • 512 GB/s for GPU Vega • Mesh network • Network on a chip, clustering • Link between GPUs and SoC • CCIXstandard: accelerators, FPGA-uri
Many core • Systems with tens or hundreds of cores • Developed forparallel computing • High Throughputand low energy consumption (sacrifices latency) • Problems such ascache coherencein multi-core systems (few cores) • They use: mesage passing, DMA, PGAS (Partitioned Global Access Space) • Not efficient for applications using just one thread • Example • Xeon Phi with 59-72 cores • GPUs: Tesla K80 with 4992 CUDA cores
CPU vs. GPU architecture • GPU has more transistors for computations
Parallel processing models Flynn classification • SISD (Single Instruction Single Data) • Uniprocesor • MISD (Multiple Instruction Single Data) • Multiple processors on a single data stream • Sistolic processor • Stream processor • SIMD (Single Instruction Multiple Data) • Same instruction is executed on multiple processors • Each processor has its own memory (different data) • Shared memory for control and instructions • Good for data level parallelism • Vector processor • GPUs(partialy) • MIMD (Multiple Instruction Multiple Data) • Each processor has its own data and instructions • Multiprocessor • Processor multithread
Centralized shared memory Multiprocessors • Symmetric MultiProcessor (SMP) • Processors areinterconectedthrough a UMA (Uniform Memory Access) backbone • Do no scale well
Distributed shared memory multiprocessors • SMP clusters • NUMA (Non Uniform Memory Access) • Physical memory for each processor (but address space is shared) • Avoids starvation: memory is accessed by one processor at a time • AMD processors implement the model through HyperTransport (2003), and Intel through QMI (2007)
Distributed shared memory multiprocessors • Enhanced scalability • Low latency for local access • Scalable memory bandwidth at low costs • But • Higher interprocessor communication times • Network technology is vital! • Complex software model
Message passing multiprocessors • Multicomputers • Communication is based onmessage passing between processorsand not shared memory access • Can call remote methods: Remote Procedure Call (RPC) • Libraries: MPI (Message Passing Interface) • Synchronous communication • Causes process synchronization • Address space allocated private addresses to each distinct processors • Example • Massive Parallel Processing (MPP) • IBM Bluegene • Clusters • Created by linked computers in a LAN
Shared memory vs. message passing • Shared memory • Easy programming • Hides the network but does not hide its latency • Hardware controlled software • To reduce communication overhead • Message passing • Explicitcommunication • Can be optimized • Natural synchronization • When sending messages • Programming is challenging as it must consider aspects hidden by the shared memory systems • Transport cost
Distributed systems • “A collectionof (probably heterogeneous) automata whose distributionis transparent to the user so that the system appears as one local machine. This is in contrast to a network, where the user is aware that there are several machines, and their location, storage replication, load balancing and functionality is not transparent. Distributed systems usually use some kind of client-server organization.” – FOLDOC • “A Distributed System comprises several single components on different computers, which normally do not operate using shared memory and as a consequence communicatevia the exchange of messages. The various componentsinvolved cooperateto achieve a common objective such as the performing of a business process.” – Schill & Springer
Parallel vs. distributed systems • Parallelism • Executing multiple tasks at the same time • True parallelism requires having as many cores as parallel tasks • Concurrent execution • Thread based computing • Can use hardware parallelism but usually derives from software requirements • Example: the effects of multiple system calls • Become parallelism if parallelism is true • Distributed computing • Refers to where the computation is performed • Computers are linked in a network • Memory is distributed • Is usually part of the objective • If resources are distributed then we have a distributed system • Raises many problems from a programming point of view • No global clock, synchronization, unpredicted errors, variable latency, security, interoperatibiliy.
Distributed systems models • Miniccomputer • Workstation • Server workstation • Processor pool • Cluster • Grid • Cloud
Mini- computer Mini- computer Mini- computer Minicomputer • Extension of time sharing systems • The user logs on the machine • Authenticates remotely though telnet • Shared resources • Data bases • HPC ARPA net
Workstation • Process migration • The user authenticates on the machine • If any networked resources are available then the process migrates there • Issues • How do you identify available resources? • How do we migrate a process? • What happens if another user logs on the available resource? Workstation Workstation Workstation 100Gbps LAN Workstation Workstation
Workstation – server • Client stations • No hard memory • Interactive/graphical processes are executed locally • All files and computations are senton the server • Servers • Each machine is dedicatedto a certain type of job • Communication model • RPC (Remote Procedure Call) • C • RMI (Remote Method Invocation) • Java • A client processinvokesa server process • There is no migration between machines Workstation Workstation Workstation 100Gbps LAN Mini- Computer file server Mini- Computer http server Mini- Computer cycle server
Processor pool • Client • User authenticates on remote machine • Allservicesare sentto servers • Server • Allocates the required number of processorsto each client • Better usage but less interaction 100Gbps LAN Server 1 Server N
Cluster Workstation • Client • Client-server model • Server • Consists of many interconnected machines through a high speed network • The aim is performance • Parallel processing Workstation Workstation 100Gbps LAN http server2 http server N http server1 Slave N Master node Slave 1 Slave 2 1Gbps SAN
Grid Workstation • Aim • Collect processingpowerfrom many clusters or parallel systems and make it availableto users • Similar in concept with a power grid • You just buy what you use • HPC distributed computing • Large problems requiring many resources • On demand • Remote resources are integrated with local ones • Big Data • Data is distributed • Shared computing • Communication between resources Super- computer High-speed Information high way Mini- computer Cluster Super- computer Cluster Workstation Workstation
cloud Workstation • Distributed systems where access to resources is virtualizedand on demand, while keeping the topology hidden • Per per use (per second, GB, query, etc.) • Access levels • Infrastructure (Iaas) • Platform (Paas) • Services (Saas) • Data (DaaS) • Amazon EC2 • Google Compute Cloud • Microsoft Azure Internet Specific services VM VM VM Database Workstation Workstation
summary • Shared memory • Homogeneous resources • Nsecaccess • Message passing • Homogeneous/heterogeneous resources • Microsecaccess • Distributed systems • Heterogeneous resources • Msec access
Sources • http://slideplayer.com/slide/5704113/ • https://www.comsol.com/blogs/havent-cpu-clock-speeds-increased-last-years/ • http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture01-intro.pdf • https://www.slideshare.net/DilumBandara/11-performance-enhancements • http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture15.pdf • http://www.ni.com/white-paper/11266/en/ • https://www.cs.virginia.edu/~skadron/cs433_s09_processors/arm11.pdf • http://www.csie.nuk.edu.tw/~wuch/course/eef011/4p/eef011-6.pdf
Next lecture • Parallelizing algorithms • APIsand platformsfor parallel and distributed computing • OpenMP • MPI • Uniform Parallel C • CUDA • Hadoop