490 likes | 504 Views
Multi-core systems COMP25212 System Architecture. Dr. Javier Navaridas. From Wednesday. Explain the differences between Snoopy and directory-based cache coherence protocols Global view through a shared medium vs local view + directory
E N D
Multi-core systemsCOMP25212System Architecture Dr. Javier Navaridas
From Wednesday • Explain the differences between Snoopy and directory-based cache coherence protocols • Global view through a shared medium vslocal view + directory • Minimal infovsextra info for directory and remote shared lines • Poor scalabilityvsbetter scalability • Explain the concept of false sharing • Pathological behaviour when two unrelated variables are stored in the same cache line • If they are written by two different cores often, they will generate lots of invalidate/update traffic
The Need for Networks • Any multi-core system must clearly contain the means for cores to communicate • With memory • With each other (coherence/synchronization) • There are many different options • Each have different characteristics and tradeoffs • Performance/energy/area/fault-tolerance/scalability • May provide different functionality • Can restrict the type of coherence mechanism
The need for Networks • Most multi- and many-core applications require some short of communication • Why having so many cores if not, we rarely run that many number of applications at the same time • Multicore systems need to provide a way for them to communicate effectively • What ‘effectively’ means depends on the context
The need for Networks Shared-memory applications Multicores need to ensure consistency and coherence • Memory consistency: ensure correct ordering of memory accesses • Synchronization within a core • Synchronization across cores – needs to send messages • Memory coherence: ensure changes are seen everywhere • Snooping: all the cores see what is going on – centralized • Directory: distributed communications; more traffic required, but higher parallelism achieved – interconnection network
The need for NetworksDistributed-memory Applications • Independent processor/store pairs • Each core has its own memory, independent from the rest • No coherence is granted at the processor level • Saves chip area • Communication/synchronization is introduced explicitly in the code – message passing • Needs to be handled efficiently to avoid becoming the bottleneck • Interconnection network becomes an important part of the design • E.g. Intel Single-chip Cloud Computer – SCC (2009) • Later replaced by the cache-coherent Xeon Phi (2012)
Evaluating Networks • Bandwidth: Amount of data that can be moved per unit of time • Latency: How long it takes a given piece of the message to traverse the network • Congestion: The effect on bandwidth and latency of using the network close to its peak • Fault tolerance • Area • Power dissipation
Bandwidth vs. Latency Definitely not the same thing: • A truck carrying one million 256Gbyte flash memory cards to London • Latency = 4 hours (14,400 secs) • Bandwidth = ~128Tbit/sec (128 * 1012 bit/sec) • A broadband internet connection • Latency = 100 microsec (10-4 sec) • Bandwidth = 100Mbit/sec (108 bit/sec)
Important features of a NoC • Topology • How cores and networking elements are connected together • Routing • How traffic moves through the topology • Switching • How traffic moves from one component to the next
Bus • Common wire interconnection – broadcast medium • Only single usage at any point in time • Controlled by clock – divided into time slots • Sender must ‘grab’ a slot (via arbitration) to transmit • Often ‘split transaction’ • E.g send memory address in one slot • Data returned by memory in later slot • Intervening slots free for use by others • Main scalability issue is limited throughput • Bandwidth divided by number of cores
Crossbar • E.g. to connect N inputs to N outputs • Can achieve ‘any to any’ (disjoint) in parallel • Area and power scale quadratically to the number of nodes – not scalable
Tree Variable bandwidth (Depth of the Tree) Variable Latency Reliability?
Ring • Simple but • Low bandwidth • Variable latency • Cell Processor - PS3 (2006)
Mesh / Grid Tilera TILE64 Processor (2007) • Reasonable bandwidth • Variable Latency • Convenient for very large systems physical layout Xeon Phi Knights Landing Processor (2016)
Minimal routing Selects always the shortest path to a destination Packets always move closer to their destination Packets are more likely to be blocked Non-minimal routing Packets can be diverted To avoid blocking, keeping the traffic moving To run away from congested areas Risk of livelock Length of Routes
Unaware of network state Deterministic routing Fixed path, e.g. XY routing Non-deterministic routing More complex strategies Pros Simpler router Deadlock-free oblivious routing Con Prone to contention Oblivious routing
Aware of network state Packets adapt to avoid contention Pros Higher performance Cons Router instrumentation is required More complex i.e. more area and power Deadlock prone Even more hardware Barely used in NoCs Adaptive Routing
Packet switching • Data is split into small packets and these into flits • Some extra info is added to the packets to identify the data and to perform routing • Allows time-multiplexing of network resources • Typically better performance, specially for short messages • Several packet switching strategies • Store and forward, cut-through, wormhole Packet Head Data
A packet is not forwarded until all its phits arrive to each intermediate node Pros On-the-fly failure detection Cons Low performance Latency: distance × #phits Large buffering required Long, bursty transmissions E.g. Internet Store and Forward Switching 24
A packet can be forwarded as soon as the head arrives to an intermediate node Pros Better performance Latency: distance +#phits Cons Fault detection only possible at the destination Less hardware Cut-through / Wormhole Switching 25
Typical Multi-core Structure core L1 Inst L1 Data core L1 Inst L1 Data Main Memory (DRAM) L2 Cache L2 Cache Memory Controller L3 Shared Cache On Chip QPI or HT PCIe Input/Output Hub PCIe Graphics Card Input/Output Controller … Motherboard I/O Buses (PCIe, USB, Ethernet, SATA HD)
Multiprocessor Shared memory Input/Output Hub Memory (DRAM) Memory (DRAM) Multi-core Chip Multi-core Chip Memory (DRAM) Memory (DRAM) Multi-core Chip Multi-core Chip QPI or HT Input/Output Hub Motherboard
Multicomputer Distributed memory ... Interconnection Network
Amdahl’s Law • Estimates a parallel system maximum performance based on the available parallelism of an application • It was intended to discourage parallel architectures • But was later reformulated to show that S is normally constant while P depends on the size of the input data • If you want more parallelism, just increase your dataset S = Fraction of the code which is serial P = Fraction of the code which can be parallel S + P = 1 N = Number of processor
Amdahl’s Law • Estimates a parallel system maximum performance based on the available parallelism of an application • It was intended to discourage parallel architectures • But was later reformulated to show that S is normally constant while P depends on the size of the input data • If you want more parallelism, just increase your dataset S = Fraction of the code which is serial P = Fraction of the code which can be parallel S + P = 1 N = Number of processor
Clusters, Supercomputersand Datacentres • All terms overloaded and misused • Have lots of CPU’s on lots of Mother boards • The distinction is becoming increasingly blurred • High Performance Computing • Run one large task as quickly as possible • Supercomputers and (to an extent) clusters • High Throughput Computing • Run as many tasks per unit of time as possible • Clusters/Farms (compute) and Datacentres (data) • Big Data Analytics • Analyse and extract patterns from large, complex data sets • Datacentres
Large numbers of self contained computers in a small form factor Optimised for cooling and power efficiency Racks house 1000s of cores High redundancy for fault tolerance They normally also contain separate units for networking and power distribution Building a Cluster,Supercomputer or Datacentre
Building a Cluster, Supercomputer or Datacentre • Join lots of compute racks • Add a network • Add power distribution • Add cooling • Add dedicated storage • Some frontend node(s) • Small user functions (compile, read results, etc) do not affect compute nodes performance
Top 500 List of Supercomputers • A list with the most powerful supercomputers in the world, updated twice a year (Jun/Nov) (www.top500.org) • Theoretical peak performance (Rpeak) vs maximum perf. running a computation intensive application (Rmax) • Let’s peek at the latest Top 10 (Nov’17)
6. Sequoia – IBM BlueGene/Q • Water cooled • 5D network for fault tolerance • Custom made chips based on IBM PowerPC • 98K processors – 1.57M cores – 6.28M threads
16-core processor (+1 service core +1 spare core) No cache coherence But supports hardware transactional memory Crossbar NoC Power7 A2 core 2-way, in-order pipeline 4-way simultaneous MT Performance 200Gflop/s Power 55Watt Efficiency 4Gflops/Watt BlueGene/Q Chip
5. Titan Supercomputer – Cray XK7 • Air cooled • 3D network • AMD Opteron Processors + NVidia Tesla GPUs • 20,000 processors – 320,000 cores - 320,000 Threads • 20,000 GPUs – 50,000,000 lightweight threads
General Purpose computing using GPUs(GPGPU) • GPU architecture (SIMD) is based on arrays of cores executing the same instruction over different data • Several of these arrays within a GPU • The best raw performance and efficiency • Peak performance (>4Tflop/s) • Relatively low power (235W) • The most power efficient (18 Gflops/Watt) NVIDIA Kepler GK110GPU
Limitations of GPGPU • Moving data from/to memory is very expensive • Large chunks of data need to be moved to start execution • Last PCIe versions alleviate this (up to 31GB/s bandwidth) • The memory accessed by all the cores within an array need to be consecutive • Indirections (pointers) difficult this • All the cores execute the same instruction • Control operations (if) require executing the if and else part in each core • The programming models differs significantly from the standard sequential models used in CPUs • Cuda, OpenCL, many others… • All these limitations affect significantly the performance we can extract from a GPU • Certain workloads can not be ported efficiently to GPUs
2. Tianhe-2 • Xeon E5-2692 12-core Processors + Xeon Phi • 32,000 processors – 384,000 cores – 768,000 Threads • 48,000 Xeon Phi – 2,736,000 cores – 10,944,000 Threads
Xeon Phi Coprocessor • A many-core accelerator • 57 4-way SMT x86 general purpose cores (228 hardware threads) • 512-byte SIMD (vector) operations • Not as good raw performance or efficiency as a GPU • Peak performance 1.2 Tflop/s • Power dissipation 300W • Overall efficiency 4Gflops/Watt • And still limited by memory access • …but usually can be used more effectively than a GPU • Programmed using well-know parallel programming models • MPI, OpenMP, autovectorization, …
Xeon Phi Architecture • All logic (cores, PCIe client and Local Memory Controllers) connected through a bidirectional ring • Communications can become a bottleneck • As we saw, newer versions (as in Cori) implement a mesh • Cache coherent through a distributed directory (TD) • Necessary to supportlarge numbers of cores • Local Memory needsto be updated with the host memory using PCIe
1. Sunway TaihuLight • Liquid cooling • Custom multistage network based on PCIe 3.0 • Custom made RISC chips based on IBM PowerPC • 41K processors – 10M cores – 10M threads
260-core processor no cache coherency mesh+crossbar NoC 8-way superscalar out-of-order cores no multithreading Performance 3.06 Tflop/s Power 500Watt Efficiency 6Gflops/Watt Sunway TaihuLight Chip
Trends in Supercomputing • Main limits to scalability are power dissipation and data movement • Top systems use accelerators (GPUs, Xeon Phi) to increase performance • High peak performance and efficiency • We still do not know how to exploit all the available performance • Many systems still rely on regular processors