1 / 47

The Alpha Roadmap How it applies to Alpha clusters

The Alpha Roadmap How it applies to Alpha clusters. Ray Hookway Compaq Computer Corporation Littleton, MA Ray.Hookway@compaq.com. Map Features. Alpha Processor Roadmap Alpha Systems Alpha Clusters. Processor Roadmap. References:

lbryant
Download Presentation

The Alpha Roadmap How it applies to Alpha clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Alpha RoadmapHow it applies to Alpha clusters Ray Hookway Compaq Computer Corporation Littleton, MA Ray.Hookway@compaq.com

  2. Map Features • Alpha Processor Roadmap • Alpha Systems • Alpha Clusters

  3. Processor Roadmap References: • Pete Bannon, “Alpha 21364: A Scalable Single-chip SMP”, Microprocessor Forum 1998, http://www.digital.com/alphaoem/microprocessorforum.htm • Joel Emer, “Simultaneous Multithreading: Multiplying Alpha Performance”, Microprocessor Forum 1999

  4. Alpha Roadmap Higher Performance 0.125mm 0.18mm 0.35mm EV8 EV7 EV6 21264 Lower Cost 0.125mm 0.28mm EV78 ... EV67 0.18mm EV68 1998 1999 2000 2001 2002 2003

  5. 58.7 SPECfp95 20.6 17.4 12.6 Compaq AlphaServer DS20 Sun UE250 HP D380 IBM F50 Alpha 21264 is performance leader

  6. Alpha 21264 Systems • AlphaServer 8400 with EV6/575 **estimated *37,541 tpmC at $79.4/tpmC for 8CPU 16GB Sybase V11.9 available 12/98

  7. IA-64 .vs. Alpha Philosophy EPIC • Smart compiler and a dumb machine • Compiler creates record of execution • Machine plays record • Stall when compiler is wrong • Focus on vector programs • Compiler transform scalar to vector • What about: • function calls, indirection • dynamic linking • C++, Java/JIT ALPHA • Smart compiler, smart machine, and a GREAT circuit design • Compiler creates record of execution • Machine exploits additional information available at runtime • Works across barriers to compile-time analysis • Focus on scalar programs • Add resources for vector • Amdahl’s law

  8. Alpha 21364 Goals • Improve • Single processor performance, operating frequency, and memory system • SMP scaling • System performance density (computes/ft3) • Reliability and availability • Decrease • System cost • System complexity

  9. “It’s the Memory, Stupid” Dick Sites

  10. Estimated time for TPC-C New core Higher MHz Higher integration

  11. Alpha 21364 Features • Alpha 21264 core with enhancements • Integrated L2 Cache • Integrated memory controller • Integrated network interface • Support for lock-step operation to enable high-availability systems.

  12. Address In N S E WI/O Network Interface 21364 Chip Block Diagram 21264 Core 16 L1 Miss Buffers R A M B U S Address Out 64K Icache L2 Cache Memory Controller 64K Dcache 16 L1 Victim Buf 16 L2 Victim Buf

  13. 21364 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Int Reg Map Int Issue Queue (20) Branch Predictors Reg File (80) Exec L2 cache1.5MB 6-Set Addr Exec L1 Data Cache 64KB 2-Set Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Addr Exec Next-Line Address 4 Instructions / cycle L1 Ins. Cache 64KB 2-Set FP ADD Div/Sqrt Reg File (72) FP Issue Queue (15) FP Reg Map Victim Buffer FP MUL Miss Address

  14. Integrated L2 Cache • 1.5 MB • 6-way set associative • 16 GB/s total read/write bandwidth • 16 Victim buffers for L1 -> L2 • 16 Victim buffers for L2 -> Memory • ECC SECDED code • 12ns load to use latency

  15. Integrated Memory Controller • Direct RAMbus • High data capacity per pin • 800 MHz operation • 30ns CAS latency pin to pin • 6 GB/sec read or write bandwidth • 100s of open pages • Directory based cache coherence • ECC SECDED

  16. Integrated Network Interface • Direct processor-to-processor interconnect • 10 GB/second per processor • 15ns processor-to-processor latency • Out-of-order network with adaptive routing • Asynchronous clocking between processors • 3 GB/second I/O interface per processor

  17. IO M IO M M IO M M IO IO IO M IO M IO M IO M IO M IO M IO M 364 364 364 364 364 364 364 364 364 364 364 364 21364 System Block Diagram

  18. Alpha 21364 Technology • 0.18 mm CMOS • 1000+ MHz • 100 Watts @ 1.5 volts • 3.5 cm2 • 6 Layer Metal • 100 million transistors • 8 million logic • 92 million RAM

  19. Alpha 21364 Performance/Status • 70 SPECint95 (estimated) • 140 SPECfp95 (estimated) • RTL model running • Tapeout 4Q99

  20. 21364 Summary • The 21364 integrated L2 cache and memory controller provide outstanding single processor performance • The 21364 integrated network interface enables high performance multi-processor systems • The high level of integration directly supports systems containing a large number of processors

  21. 21464 Overview • Enhanced out-of-order execution • 8-wide superscalar • Large on-chip L2 cache • Direct RAMBUS interface • On-chip router for system interconnect • Glueless, directory-based, ccNUMA for up to 512-way SMP • 4-way simultaneous multithreading (SMT)

  22. Superscalar Instruction Issue Time

  23. Multi-Threading Time

  24. Simultaneous Multi-Threading Time

  25. What Changed? • Multiple Program Counters • Choose among them • More Architectural Register Space • Mapper • Register Files • Distinguished Per Thread Instruction State • Register mapping • Instruction Retire • Store Buffers • Abort and Restart Information

  26. What Didn’t Change Almost everything else • No basic functional changes in any stage • No partitioned instruction cache • No partitioned data caches • No partitioned off-chip caches • No extra register files • Little special branch prediction mechanism

  27. Multi-threaded Scaling • 1.9x 1.8x 2.0x 2.3x

  28. AlphaServer Family Today AlphaServer ES Series • Up to 32GB of memory • 1- 4 Processors • Up to 10 PCI slots AlphaServer GS Series • 1-64Processors • Up to 128GB of memory • Up to 224 PCI slots • AlphaServer DS Series • Uni and dual processor systems • Offerings scale to 8GB memory • Up to 6 PCI slots Switched based system - 64-bit PCI I/O subsystems - Very Large Memory Modular system packaging - advanced systems management Scalable clusters on DIGITAL UNIX, OpenVMS

  29. AlphaServer DS10 Fast Memory Access • Large total RAM -128MB up to 1GB • High bandwidth access - 1.3 GB/s Flexible Internal Storage • Internal dual channel IDE storage included • Optional SCSI adapter supported • 3 internal disk bays Special Features • 4 PCI I/O slots (3 64-bit, 1 32-bit) • 300 watt power supply • 3U Small footprint - Rack or Desktop • Dual embedded 10/100 Ethernet ports

  30. New AlphaServer DS Series Solution for project environment • Fastest Uni processor design in a 1U formfactor • Fastest Memory Access with the Highest Bandwidth memory in its class • High speed I/O with 64 bit PCI • Sleek, compact and powerful package • Dual Purpose Solutions Support • Rack and desktop-ready for space constrained environments

  31. Fast Memory Access Large total RAM - 64MB up to 1GB High bandwidth access - 1.3 GB/s Flexible Internal Storage Internal dual channel IDE Wide range of PCI Options supported 2 disk bays-27 GB IDE / 18 GB SCSI Special Features Optional Slimline CD-Floppy Combo Toolless features-snap out CD and Disk Full 1 PCI I/O slot (64-bit) 150 watt power supply 1U (1.75”) Small footprint – Rack or Desktop Dual embedded 10/100b Ethernet ports Performance and Management Features Remote management console Serverworks and Compaq Insight Manager AlphaServer DS Series

  32. New AlphaServer DS Series

  33. Complementary, low-cost, open source model. Leadership performance over other Linux platforms. Tru64 UNIX compatibility with common SWD tools Support services through Compaq and partners Scalable, robust HPC platform Maximum performance over broadest range of applications Outstanding system management and reliability features Two ways to build an Alpha cluster Sierra Beowulf

  34. Sierra Architecture • Tera-scale systems derived from ASCI PathForward • Very large Distributed Shared Memory systems • High speed, scalable interconnect (Quadrics) • Exploit EV6, EV7 & EV8 • Installed and administered as single system • System wide scheduler • High performance file systems (PFS, CFS, AdvFS) • Application availability

  35. Sierra – ASCI Pathforward Project

  36. Alpha Beowulf Clusters • Compaq ships 64-bit Linux on Alpha systems • Myrinet and other popular interconnects are supported • SeverNet-II available in late 1999 • Compaq Tru64Unix (Digital Unix) development tools ported in 1999 (!)

  37. CT-D10MJ-SR Starter DS10-based Beowulf cluster, including eight Alphaserver DS10 compute nodes, one Alphaserver DS10 management station with keyboard/trackball and display, Myrinet™ system area network, 73.1 GB JBOD UltraSCSI disk storage, Ethernet multiplexer for system management, and all Linux software required for basic Beowulf operation. Prepackaged Beowulf Cluster

  38. ServerNet-II Interconnect • Scalable high-performance network. • 65,536 end nodes, 5 km range. • Multi-gigabit, low latency, low CPU, cheap. • VIA - Virtual Interface Architecture. • MPI - Message Passing Interface. • Open source Intel and Alpha Linux drivers. • NT, Tru64, NonStop Clusters, VxWorks.

  39. Virtual Interface Architecture (VIA) Communication through “Virtual Interfaces (VI)” with associated “Completion Queues (CQ)”. Applications DBMS Apps • 65,536 VIs per node. • RDMA & send/recv. • Reliable reception. • < 2% CPU utilization. • Low latency/zero copy. • Thread-safe, protected. • Basis for COMPAQ’s “System I/O”. OS Vendor API VI Primitive Library Open/Close/Map Memory Send/Receive/Read/Write CQ VI VI VI VI Kernel Support VI Kernel HW Interface SAN Media Interface (ServerNet, Ethernet, ...)

  40. ServerNet-II Components Beowulf.loc1.Tandem.com

  41. FCAL bridge, dual line card, or LAN bridge FCAL bridge, dual line card, or LAN bridge Router II IBC Logic Line Card Line Card Line Card Line Card Line Card Line Card Line Card Line Card ServerNet-II Hardware Components • Dual-port PCI interface (NIC) • VIA in hardware, DCE • negligible CPU cost • 64 bit, 33 MHz & 66 MHz • 12 port crossbar switch • wormhole routed • < 300 nsec latency • “fat pipe” channel bonding • bridges to fibre channel, gigabit ethernet • Gigabit ethernet cables • copper or fibre optic • 5 meters to 5 km

  42. Single VI Multiple VIs 33 MHz PCI-64 166 MB/s 240 MB/s 66 MHz PCI-64 197 MB/s 350 MB/s ServerNet-II Hardware Performance • 1.25+1.25 gigabit/s links 1999, doubles 2001. • < 300 nanosecond path formation per stage. • 1M end nodes, 5 km fibre optic links.

  43. Reliability • Dual port NICs, dual network topologies. • Link level CRC, in-band control protocol. • Strong packet ordering guarantees. • Every packet is acknowledged by receiver. • Automatic retry on transmission failure. • Avoids deadlock & livelock.

  44. Boosted performance with Compaq Portable Math Library (CPML) for Linux on Alpha Significantly increases the precision and speed of mathematical calculations up to 10 times compared to other mathematical libraries currently available on Linux Following the success of the Portable Math Library, now announcing plans for Compaq Extended Math Library to run on Linux AlphaServer systems Compaq C compiler announced in April Compaq Fortran compilers for Linux announced in April available in July beta program New Compaq C++ compiler Makes it easy to support both Linux and Tru64 UNIX software New Software Development Test-Drive capabilities Test out the performance of your Application over the Web Get help from our leading Linux developers to optimize your application Linux Software Development Tools for AlphaServers

  45. SPEC CPU Benchmark* * not audited

  46. Linpack (100x100) MFlops

  47. Summary • Alpha is the fastest processor available • Alpha is available in a full range of high performance systems • Sierra systems provided complete tera-scale solutions • Compaq wants to be involved in the Beowulf community

More Related