1 / 31

The Cray X1 Multiprocessor and Roadmap

The Cray X1 Multiprocessor and Roadmap. Cray’s Computing Vision. Scalable High-Bandwidth Computing. 2010 ‘Cascade’ Sustained Petaflops. ‘Black Widow’. ‘Black Widow 2’. 2006. X1E. 2004. Product Integration. X1. 2006. ‘Strider X’. ‘Strider 3’. 2004. 2005. ‘Strider 2’. RS.

taariq
Download Presentation

The Cray X1 Multiprocessor and Roadmap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Cray X1 Multiprocessor and Roadmap

  2. Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ Sustained Petaflops ‘Black Widow’ ‘Black Widow 2’ 2006 X1E 2004 Product Integration X1 2006 ‘Strider X’ ‘Strider 3’ 2004 2005 ‘Strider 2’ RS Red Storm

  3. Cray X1 • Cray PVP • Powerful vector processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features • Modernized the ISA • T3E • Extreme scalability • Optimized communication • Memory hierarchy • Synchronization features • Improved via vectors Extreme scalability with high bandwidth vector processors

  4. Cray X1 Instruction Set Architecture New ISA • Much larger register set (32x64 vector, 64+64 scalar) • All operations performed under mask • 64- and 32-bit memory and IEEE arithmetic • Integrated synchronization features Advantages of a vector ISA • Compiler provides useful dependence information to hardware • Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) • Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes)  excellent fit with future IC technology • Latency tolerance and pipelining to memory/network • very well suited for scalable systems • Easy extensibility to next generation implementation

  5. Not Your Father’s Vector Machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently • Classic vector: Optimize for loop length with little regard for locality • Scalable micro: Optimize for locality with little regard for loop length • The Cray X1 is programmed like other parallel machines • Rewards locality: register, cache, local memory, remote memory • Decoupled microarchitecture performs well on short loop nests • (however, does require vectorizable code)

  6. P P P P P P P P P P P P P P P P $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ M M M M M M M M M M M M M M M M mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem mem IO IO Cray X1 Node 51 Gflops, 200 GB/s • Four multistream processors (MSPs), each 12.8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node

  7. NUMA Scalable up to 1024 Nodes Interconnection Network • 16 parallel networks for bandwidth • Global shared memory across machine

  8. Network Topology (16 CPUs) P P P P node 0 M15 M1 M0 P P P P node 1 M15 M1 M0 P P P P node 2 M0 M15 M1 P P P P node 3 M0 M15 M1 Section 0 Section 15 Section 1

  9. R R R R R R R R Network Topology (128 CPUs) 16 links

  10. Network Topology (512 CPUs)

  11. Designed for Scalability T3E Heritage • Distributed shared memory (DSM) architecture • Low latency, load/store access to entire machine (tens of TBs) • Decoupled vector memory architecture for latency tolerance • Thousands of outstanding references, flexible addressing • Very high performance network • High bandwidth, fine-grained transfers • Same router as Origin 3000, but 32 parallel copies of the network • Architectural features for scalability • Remote address translation • Global coherence protocol optimized for distributed memory • Fast synchronization • Parallel I/O scales with system size

  12. Decoupled Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control • Scalar and vector loads issued early • Store addresses computed and saved for later use • Operations queued and executed later when data arrives • Hardware dynamically unrolls loops • Scalar unit goes on to issue next loop before current loop has completed • Full scalar register renaming through 8-deep shadow registers • Vector loads renamed through load buffers • Special sync operations keep pipeline full, even across barriers This is key to making the system like short-VL code

  13. Address Translation • High translation bandwidth: scalar + four vector translations per cycle per P • Source translation using 256 entry TLBs with support for multiple page sizes • Remote (hierarchical) translation: • allows each node to manage its own memory (eases memory mgmt.) • TLB only needs to hold translations for one node scales

  14. Cache Coherence Global coherence, but only cache memory from local node (8-32 GB) • Supports SMP-style codes up to 4 MSP (16 SSP) • References outside this domain converted to non-allocate • Scalable codes use explicit communication anyway • Keeps directory entry and protocol simple Explicit cache allocation control • Per instruction hint for vector references • Per page hint for scalar references • Use non-allocating refs to avoid cache pollution Coherence directory stored on the M chips (rather than in DRAM) • Low latency and really high bandwidth to support vectors • Typical CC system: 1 directory update per proc per 100 or 200 ns • Cray X1: 3.2 dir updates per MSP per ns (factor of several hundred!)

  15. System Software

  16. SB SB RS200 FC CB CB SB SB IOCA SB SB FC SB SB FC IOCA FC I/O Board FC IOCA FC FC IOCA FC Cray X1 I/O Subsystem PCI-X BUSES 800 MB/s SPC Channels 2 x 1.2 GB/s PCI-X Cards 200 MBytes/s FC-AL XFS and XVM ADIC SNFS I Chip Gigabit Ethernet or Hippi Network CNS RS200 X1 Node I Chip IP over Fiber CPES Gigabit Ethernet or Hippi Network CNS I/O Board

  17. Cray X1 UNICOS/mp Single System Image UNIX kernel executes on each Application node (somewhat like Chorus™ microkernel on UNICOS/mk) Provides: SSI, scaling & resiliency UNICOS/mp Global resource manager (Psched) schedules applications on Cray X1 nodes 256 CPU Cray X1 Commands launch applications like on T3E (mpprun) System Service nodes provide file services, process manage- ment and other basic UNIX functionality (like /mk servers). User commands execute on System Service nodes.

  18. Programming Models • Shared memory: • OpenMP, pthreads • Single node (51 Gflops, 8-32 GB) • Fortran and C • Distributed memory: • MPI • shmem(), UPC, Co-array Fortran • Hierarchical models (DM over OpenMP) • Both MSP and SSP execution modes supported for all programming models

  19. Mechanical Design

  20. Node board Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network Interconnect PCB 17” x 22’’ Air Cooling Manifold

  21. Cray X1 Node Module

  22. Cray X1 Chassis

  23. 64 Processor Cray X1 System~820 Gflops

  24. Cray BlackWidow The Next Generation HPC System From Cray Inc.

  25. BlackWidow Highlights • Designed as a follow-on to Cray X1: • Upward compatible ISA • FCS in 2006 • Significant improvement (>> Moore’s Law rate) in: • Single thread scalar performance • Price performance • Refine X1 implementation: • Continue to support DSM with 4-way SMP nodes • Further improve sustained memory bandwidth • Improve scalability up and down • Improve reliability/fault tolerance • Enhance instruction set • A few BlackWidow features: • Single chip vector microprocessor • Hybrid electrical/optical interconnect • Configurable memory capacity, memory BW and network BW

  26. Cascade ProjectCray Inc.StanfordCal Tech/JPLNotre Dame

  27. High Productivity Computing Systems Goals: • Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010) Impact: • Performance (efficiency): critical national security applications by a factor of 10X to 40X • Productivity (time-to-solution) • Portability (transparency): insulate research and operational application software from system • Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors HPCS Program Focus Areas • Applications: • Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology Fill the Critical Technology and Capability Gap Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

  28. HPCS Planned ProgramPhases 1-3 Early Academia Early Metrics and Benchmarks Metrics, Products Software Research Pilot Benchmarks HPCS Capability or Products Tools Platforms Platforms Requirements and Metrics Application Analysis Performance Assessment Research Prototypes & Pilot Systems System Design Review PDR & Early Prototype Concept Reviews Industry Guided Research Technology Assessments Research Phase 2 Readiness Reviews Phase 3 Readiness Review Fiscal Year 02 03 04 05 06 07 08 09 10 Reviews Industry BAA/RFP Critical Program Milestones Phase 1 Industry Concept Study Phase 2 R&D Phase 3 Full Scale Development (Planned)

  29. Technology Focus of Cascade • System Design • Network topology and router design • Shared memory for low latency/low overhead communication • Processor design • Exploring clustered vectors, multithreading and streams • Enhancing temporal locality via deeper memory hierarchies • Improving latency tolerance and memory/communication concurrency • Lightweight threads in memory • For exploitation of spatial locality with little temporal locality • Thread  data instead of data  thread • Exploring use of PIMs • Programming environments and compilers • High productivity languages • Compiler support for heavyweight/lightweight thread decomposition • Parallel programming performance analysis and debugging • Operating systems • Scalability to tens of thousands of processors • Robustness and fault tolerance Lets us explore technologies we otherwise couldn’t. A three year head start on typical development cycle.

  30. Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ Sustained Petaflops ‘Black Widow’ ‘Black Widow 2’ 2006 X1E 2004 Product Integration X1 2006 ‘Strider X’ ‘Strider 3’ 2004 2005 ‘Strider 2’ RS Red Storm

  31. Questions? File Name: BWOpsReview081103.ppt

More Related