1 / 13

COMP60621 Concurrent Programming for Numerical Applications

COMP60621 Concurrent Programming for Numerical Applications. Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester. Overview. Processor AMD Opteron quad-core processor (‘Shanghai’)

cicily
Download Presentation

COMP60621 Concurrent Programming for Numerical Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP60621Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

  2. Overview • Processor • AMD Opteron quad-core processor (‘Shanghai’) • Chronos has four processors (i.e. 16 cores) • Cache structure • L1 and L2 cache per core • L3 cache shared between the four cores • Memory • 6GB (6 x 1GB memory modules) per processor (24GB total) • Interconnect • AMD ‘Direct Connect Architecture’ (Coherent HyperTransport Technology) • No ‘Front side bus’, as found in some Intel platforms • Performance issues • Further Information

  3. Processor: Quad-Core AMD Opteron Source: www.amd.com, Quad-Core AMD Opteron Product Brief

  4. Processor – AMD Opteron 8378 • ‘Shanghai’ 64 bit • 2.4GHz clock speed • Separate 64KB level 1 data and instruction caches per core • 2-way set associative, LRU replacement, exclusive • 512KB level 2 cache per core (exclusive, i.e. data in L1 does not need to be in other caches) • unified (code and data) • 16-way set associative, pseudo LRU replacement • 6144KB (6MB) level 3 cache per processor (can be inclusive) • Shared by 4 cores • unified • 64-way set associative, pseudo LRU replacement • Cache line sizes are 64B (‘unit of coherency’)

  5. AMD Opteron cache behaviour • L1 and L2 are exclusive caches • data is never in both caches. L2 holds data evicted from L1 • On L2 hit, data is moved to L1 and removed from L2 • L2 evicts data to L3 • Access to an address that would lead to an L3 miss brings data straight to L1 • Only after eviction from L1 and L2 does data come into L3 (L2 and L3 are ‘victim’ caches) • If data is required in L1 again, L3 keeps a copy (inclusive behaviour) if the data is likely to be shared with other cores but doesn’t keep a copy if the data is unlikely to be shared (exclusive). • Cache behaviour on the Opteron is ‘mostly exclusive’

  6. AMD Opteron latencies • Getting data into the registers • L1 access, 3 cycles then 1 cycle per load (~1.5ns) • L2 access, 9 cycles beyond L1 (~4ns) • L3 access, 29 cycles (at best) (~13ns) • Local memory (read access), ~140ns (not directly related to cpu cycles!) • An average benchmarked figure using, e.g. lmbench • On chronos, 1 cpu cycle is just under ~0.42ns • Memory access time is approximate… • Depends on how much work the memory system has to do to get the data and how ‘busy’ it is

  7. AMD Opteron 4P server architecture Source: www.amd.com, AMD 4P Server and Workstation Comparison

  8. AMD Quad-quad ccNUMA architecture • Each processor is directly connected to some memory • Each processor has a memory controller • Bandwidth, 12.8GB/s (aggregate over two channels) • Processors are connected to each other with: • Bi-directional Coherent HyperTransport Technology (HT) • Coherency unit is 64 Bytes (i.e. cache line size) • Up to 8.0GB/s per link (4GB/s in each direction) • 3 HT links per processor, usually 2 used to connect to other processors and 1 used for I/O (via PCI bridge) • Separate memory and I/O paths • Compare with Front side bus architecture used by, e.g., Intel

  9. Performance issues • Cores on the same processor can access directly some of the system’s memory (local memory) through the cache hierarchy • Can communicate with each other via shared L3 cache • Cores on different processors access remote memory via the cHT (coherent HyperTransport) links which maintains coherency of data in the L3 caches (and memory) • Access to remote memory may take 1 ‘hop’ (to memory on two other processors one cHT link away) or 2 ‘hops’ (to memory on the fourth processor, two cHT links away)

  10. AMD Opteron Memory latencies • Local memory reads, =100% (base case) • Local memory writes, ~113% • 1 hop reads, ~108% • 2 hop reads, ~130% • 1 hop writes, ~128% • 2 hop writes, ~150% • Remember, data is placed in physical memory as a result of a ‘first touch’ by a thread policy! • This is bechmarked data, 1 thread, idle machine

  11. Further information • See www.amd.com. Follow: Products and Technologies -> Server Products -> Server Processors: • Product Brief • Key Architectural Features • Direct Connect Architecture • HyperTransport Technology • Quad-Core AMD Opteron Processor 4P Server and Workstation Comparison • Another useful, though slightly old, document is: • Performance Guidelines for AMD Athlon and Opteron ccNUMA Multiprocessor Systems. Available at: www.amd.com.cn/CHCN/assets/content_type/white_papers_and_ tech_docs/40555.pdf

  12. Information on chronos • Look in files such as: • /proc/cpuinfo • /proc/meminfo • /sys/devices/system/cpu/cpu0/cache/index0 to index3 • From information in /proc/cpuinfo you can create a map of the logical processor ids (in the range [0-15], one per core) to physical processor ids [0-3] and (physical) core ids [0-3]. • You should do this!

  13. Results of vec.f on chronos Performance (Mflop/s) L1 = 64KB L2 = 512KB L3 = 6MB log10N (bytes)

More Related