240 likes | 375 Views
Sun’s weak points in UE10000. Sun’s Weak Points in UE10000. DSD/DR is Not used by Customers Sun will not provide DSD reference sites [Giga]. Regular system administrator can not do the DSD/DR changes, it takes very skilled system administrator to handle the DSD/DR changes [Giga].
E N D
Sun’s Weak Points in UE10000 • DSD/DR is Not used by Customers • Sun will not provide DSD reference sites [Giga]. • Regular system administrator can not do the DSD/DR changes, it takes very skilled system administrator to handle the DSD/DR changes [Giga]. • Very few customers use DSD/DR in database related production environment. DRS/DR are used more often in testing environment [Giga]. • Few customers use DSDs. Those who do say it works fine most of the time. [Gartner]. • Quality Problems • Terrible problems with USII last year [unable to do root cause analysis]. Some customers won’t return to Sun, but will stay in Sun fold with Fujistu [Giga]. • E Cache problem does not only bring down the affected domain, it brings the whole UE10K down. • Sun has been having great difficulty to design reliable Enterprise level servers. Due to their background as a workstation vendor they are behind in “design for reliability” technology. • The UltraSPARC II based systems did not have ECC in cache memory with all the reliability problems as a result. The USIII now supports ECC in level-2 cache, but they are still behind as they have no chip-kill technology or DMR. • No Virtual Partitions • No Goal based and Multi System Workload Management 11.21Partitions Review
SINGLE POINTS OF FAILURE (SPOF) HP has the lowest SPOF failure rate: The SPOF failure rate between partitions in Superdome (called the 'infrastructure failure rate') is lower than the infrastructure failure rate of S390 Lpars and certainly much lower thanSUN UE10K domains How can this be??? when SUN quotes that the UE10K has “Complete Hardware Redundancy”? SUN’s definition on SPOF: Looking carefully at the literature, “Complete Hardware Redundancy” means: A fully redundant system will always recover from a system crash, by using (booting from) standby hardware. Therefore, this “complete hardware redundancy” is really a collection of ‘single points of failure’ by HP’s definition (the one the customer cares about). Source: Ken Pomaranski, Hardware HA Architect 11.21Partitions Review
Does Sun really understand reliability? • From UE10K RAS manual: • “Sun has made the time required for a module replacement much shorter [over time]. This enhancements coupled with improved diagnostic capabilities have reduced the cycle time on systems, simultaneously increasing reliability and availability. “ • There is currently no industry adopted means to measure MTBF. Therefore, comparisons between vendors is of questionable use. • “Each UE10K can be configured to have 100% HW redundancy” Isn’t reliability about ‘keeping systems running?’ How then does Sun track server reliability? Shouldn’t the UE10K then never fail? 11.21Partitions Review
Sun’s Customers Understand! • Topping their list of complaints are the frequency of server crashes caused by the problem [memory], fixes that don't work and Sun's tendency to initially blame the problem on other factors before acknowledging it - often only under a nondisclosure agreement. – Computer World – 9/04/2000 • "They treated the whole thing like a cover-up“, said one user at a large utility in the Western U.S. who asked not to be named. – Computer World – 9/04/00 • “The long-standing nature of the problem and Sun's handling of the issue raise troubling questions about the quality of Sun's hardware and support” – Gartner group • Engineers have long known that memory chips can be disrupted by radiation and other environmental factors. That is why Hewlett-Packard and IBM use error-correcting code, or ECC, which detects cache errors and restores bits that were changed by mistake. – Forbes 11/13/2000 • Sun servers lack ECC protection. "Frankly, we just missed it. It's something we regret at this point," Shoemaker [Sun executive VP] says. – Forbes 11/13/2000 What else have they ‘missed’?? 11.21Partitions Review
Sun’s UE10K Dynamic Reconfiguration Weaknesses Sun’s UE10K implementation of DR is not quite as dynamic as SUN would have you believe. It’s a marketing tale!!! • Hot swapping I/O requires that CPU and memory also be brought down. • Any DR activity requires that the database be shut down, therefore making applications unavailable during the process. • DR cannot be used in combination with memory interleaving across system boards which reduces maximum performance. Sun customers have to choose between good system performance or DR functionality, but cannot get both at the same time! • DR is not supported in combination with SunCluster fail-over. Since during a DR operation the system halts, SunCluster considers this system to be failing and starts a fail-over procedure to another system. Sun customers have to choose between a true multi-system, high availability solution and the use of DR, but cannot get both at the same time! • DR conflicts with Intimate Shared Memory (ISM) used by demanding applications.To improve performance, most memory intensive applications, like databases, make use of the Intimate Shared Memory (ISM) capability in the E10000. Most applications using ISM do not allow dynamic addition or removal of their shared memory allocation. Using memory intensive applications with ISM (like large databases) and making the most efficient use of partitions prevent the use of DR. • Deactivating/moving a system board with full memory can take 15 minutes (backup and rearrange memory contents). All activities in the affected partitions(s) have to be paused during that time! (To compensate Sun introduced TurboDR boards with just CPU’s, no memory...) Source: John Wiltschut, BSTO Marketing 11.21Partitions Review
Sun blames HP and IBM for copying the E10000 The truth is: • Superdome is more original than the E10000 has ever been: the E10K is an exact copy of the Cray CS6400 • Sun is just playing catch-up with the E10000’s inferior performance, reliability and functionality • The E10000 is an end-of-line product based on old technology and without future expansion capabilities • Superdome is built as an advanced architecture based on the latest technology and with a very strong growth potential • Sun has never developed a high-end server by themselves. Heard of Superdome? 11.21Partitions Review
The E10000 is COPIED by Sun (from Cray) • The CS6400 was developed by Cray and announced in 1993. • It supported up to 64 SuperSPARC processors (60 MHz) and ran CRS-OS, based on Solaris, but modified by Cray. • Most of the CS6400 used less than 30 CPU’s as it did not scale very well. • In 1996 Sun purchased this technology from Cray/SGI and introduced a copy in 1997 under the name E10000. • All basic technology was already present in the CS6400 and Sun has never added any break-through improvements 11.21Partitions Review
HP Superdome supports 64 CPU’s in a single system with SMP functionality. • Superdome is built as an advanced architecture based on the latest technology and with a very strong growth potential. The modular packaging allows you to use only half the size up to 32 processors. • SD has 3 base cabinet configu-rations. The E10K comes in full size, even with only a few CPUs. • A 48-CPU Superdome delivers 71% more performance* in a system that is only 20% wider than a 64-CPU E10000. 64 SMP CPUs in Single Cabinet • Sun claims: • Supported with Solaris since 1993 • The reality: • The Cray CS6400 (announced in 1993) was not developed by Sun, ran CRS-OS and had very limited scalability. • The E10K is a copy of the CS6400 without significant breakthrough technology added by Sun. * based on TPC benchmark with Oracle 11.21Partitions Review
Full Dynamic Partitioning • HP is the first vendor to provide the full spectrum of partitioning: Hyperplex, nPartitions, virtual partitions and automatic resource partitioning. The different levels of partitioning can be combined as desired. • nPartitions can be added and removed within an active Superdome. • Virtual Partitions are dynamic at the CPU level, not just the cell level. • Sun claims: • Supported with Solaris since 1997 • Sun still does not support “full” dynamic partitioning (it does not support dynamic control by applications). Dynamic System Domains (DSD) require operator intervention and usually a reboot. • The use of DSD has many limitations: it cannot be combined with memory interleaving, SunCluster fail-over or Intimate Shared Memory*. Domains always have to be multiples of 4 CPU’s. The reality: * see whitepaper DSD and DR -- the true story 11.21Partitions Review
only hp offers the full spectrum of partitioning new! new! isolation flexibility resource partitions hard partitions with multiple nodes virtual partitions within hard partitions hard partitions within a node prm(Process Resource Mgr) hp-ux wlm(Workload Manager) virtual partitions hyperplex nPartitions • hardware isolation per cell • complete software isolation • multiple OS images • complete hardware and software isolation • multiple OS images • dynamic resource allocation • automatic goal-based resource allocation via set slo’s • 1 OS image • software isolation • multiple OS images • suncluster • no high-speed interconnect • 8 node max. • doesn’t work with sun’s dr • dynamic system domains (dsd) • require reboot in most situations • difficult to modify configuration (sun experts are usually needed) • solaris resource manager (srm) • expensive • doesn’t manage i/o • not goal-based like hp-ux wlm No ...Sun can’t match 11.21Partitions Review
HP-UX can dynamically deallocate processors and memory with DPR and DMR (dynamic processor and memory resilience) in case of failures. This is a fully automatic process. • Cell boards can be added and removed in an active Superdome. • HP has been using error checking and correcting in cache memory to prevent most processor and system failures. Sun hasn’t in the US II. Automated DR* / Hot-swap CPU + Memory • Sun claims: • Supported with Solaris since 2000/1997 The reality: • Automated DR is nothing more than scripting of an otherwise manual cell board replacement process. Dynamic Reconfiguration (DR) has many limitations (similar to DSD’s**) • If a processor fails then the domain crashes and a reboot is required. This is neither automatic nor dynamic. * DR = Dynamic Reconfiguration ** see whitepaper DSD and DR -- the true story 11.21Partitions Review
Interdomain Networking • HP supports other high-speed communication links like Hyperfabric, Fibre-Channel etc., and recommends not to use IDN because of the lack of isolation between partitions. • Sun claims: • Supported with Solaris since 1999 The reality: • Interdomain networking (IDN) uses shared memory and the connected domains are not isolated from failures in the other domains. As IDN violates hardware isolation (the main reason for partitioning) it increases the risk of down-time. • Sun does not support high-speed interconnect like Hyperfabric for high-bandwidth data transfer between nodes and partitions. 11.21Partitions Review
Clustered File Systems • HP supports multiple file system options depending on customer needs. CIFS/9000 is a global file system supporting multi-platform, multi-OS file systems. • MC/ServiceGuard provides a superior , mature solution with support up to 16 nodes, hundreds of applications and has more than 45,000+ installations. Hyperplex supports hundred of clustered nodes. • Sun claims: • Supported with Solaris since 2000 (December) The reality: • This was promised for SunCluster 3.0 but was never delivered (confirmed during the press conference). Sun tries to get around it by using marketing terms like ‘cluster-aware file system’ and ‘cluster file service’. • Sun’s clustering solutions have always been behind and customers have always preferred other solutions. Even now SunCluster 3.0 only support 8 nodes and is focused on Solaris only. 11.21Partitions Review
Global Network Services • HP ‘s MC/ServiceGuard already provides flexible IP addresses so that applications can fail-over to other nodes in a cluster without any problem. • HP is focused on supporting multi-platform, multi-OS environments based on customer demand. • Sun claims: • Supported with Solaris since 2000 (December) The reality: • This is mainly about abstracting an IP service from a network interface, such that applications can be moved in a cluster (HA fail-over). To speak in Sun terms: nothing new... • Sun is focused on Solaris-only solutions with no support for multi-OS. 11.21Partitions Review
What Sun does not say... • Sun’s current systems do not have Error Checking and Correcting, Dynamic Processor and Memory Resilience or Chip-Kill technology. • Analysts and press have reported serious problems with Sun E10000 systems at customer sites. See the Forbes and Gartner articles. Reliability • The US II processor lacks performance compared to current HP’s offerings, resulting in much lower system performance. Even the US III will barely meet the current PA-RISC performance levels. Performance Sun’s systems are lagging in all these areas I/O bandwidth • Today’s applications like broadband and datawarehousing requires high I/O bandwidths, which Sun does not deliver. Investment protection • Current Sun products are basically end-of-life. The US III requires new boxes and runs only the Solaris 8 OS. • Sun’s vision is limited to Solaris/SPARC only; Not towards multi-platform environments. Multi-platform support 11.21Partitions Review
Who is really playing Catch-Up? 11.21Partitions Review
leadership performance, flexibility, availability performance/ hp superdome sun e10000 scalability CPU memory I/O tpm flexibility hyperplex nPartitions virtual partitions resource partitions utility pricing iCOD IA-64 Multi-OS availability multi-system single system investment protection 64 64 256 64/128 192 64 200K+ 115K/156K leadership limited weakness 11.21Partitions Review Page 19
Sun’s Dark Secret Sun Screen Sun Microsystems’ servers have been crashing for more than a year. Sun has kept the flaw secret--and hasn’t yet fixed it 11/13/2000 11.21Partitions Review
Sun and HP Reliability Comparisons
Event Mgmt Multi-system HA Nothing matters without this! Flexible Compute Management Virtual Partitions Hard Partitions High Quality / Resilient Hardware (Hardware that keeps running) Why HP can fulfill the customer needs better than Sun HP understands what available systems really mean. Availability is the BASE upon which all other features are built: 11.21Partitions Review
Reliability Comparison CPU MEMORY IO BACKPLANE 11.21Partitions Review
Reliability Comparison (2) SOLUTIONLEVEL HP projects that the above reliability ‘oversights’ result in SUN systems with 2-4x greater failure rates than HP systems. This has been proven by field experience. (*) Rather than blame customers for quality problems, HP closely tracks field data and works PROACTIVELY to fix potential field quality problems. 11.21Partitions Review