290 likes | 464 Views
The Challenge of Scale. Dan Reed Dan_Reed@unc.edu Chancellor’s Eminent Professor Vice Chancellor for IT and CIO University of North Carolina at Chapel Hill Director, Renaissance Computing Institute (RENCI) Duke University North Carolina State University
E N D
The Challenge of Scale Dan Reed Dan_Reed@unc.edu Chancellor’s Eminent Professor Vice Chancellor for IT and CIO University of North Carolina at Chapel Hill Director, Renaissance Computing Institute (RENCI) Duke University North Carolina State University University of North Carolina at Chapel Hill Renaissance Computing Institute
On Being the Right Size The most obvious differences between different animals are differences of size, but for some reason the zoologists have paid singularly little attention to them. … But yet it is easy to show that a hare could not be as large as a hippopotamus, or a whale as small as a herring. For every type of animal there is a most convenient size, and a large change in size inevitably carries with it a change of form. J. B. S. Haldane Renaissance Computing Institute
You Might Be A Big System Geek If … • You think a $2M cluster • is a nice, single user development platform • You need binoculars • to see the other end of your machine room • You order storage systems • and analysts issue “buy” orders for disk stocks • You measure system network connectivity • in hundreds of kilometers of cable/fiber • You dream about cooling systems • and wonder when fluorinert will make a comeback • You telephone the local nuclear power plant • before you boot your system Renaissance Computing Institute
How Big Is Big? • Every 10X brings new challenges • 64 processors was once considered large • it hasn’t been “large” for quite a while • 1024 processors is today’s “medium” size • 2048-8096 processors is today’s “large” • we’re struggling even here • 100K processor systems • are in construction • we have fundamental challenges … • … and no integrated research program Norman et al Renaissance Computing Institute
Large Systems Renaissance Computing Institute
Petascale Systems In 2004 • Component count for 1 PF peak • 200,000 5 GF processors • assuming 4-way nodes • 50,000 NICs and associated switches • enough fiber to wire a small country • (optional) 50,000 local scratch disks • assuming 1 GB/processor • 200 TB of DRAM • they’re called jellybeans for a reason! • Other sundry (but stay tuned for alternatives) • one unused football field (either flavor) • conveniently located electrical generating station • >20 MW (w/o cooling) at ~100 watts/processor • Power issues • 20 MW at $0.05/KWH is $1000/hour or $8.7M/year IBM Blue Gene/L Renaissance Computing Institute
Petascale Systems In 2008 • Technology trends • multicore processors • IBM Power4 and SUN UltraSPARC IV • Itanium “Montecito” in 2005 • quad-core and beyond are coming • reduced power consumption • laptop and mobile market drivers • increased I/O and memory interconnect integration • PCI Express, Infiniband, … • Let’s look forward a few years to 2008 • 8-way or 16-way cores (8 or 16 processors/chip) • ~10 GF cores (processors) and 4-way nodes (4, 8-way cores/node) • 12x Infiniband-like interconnect • With 10 GF processors • 100K processors and 3100 nodes (4-way with 8 cores each) • 1-3 MW of power, at a minimum Renaissance Computing Institute
Power Consumption “The power problem is the No. 1 issue in the long-term for computing. It's time for us to stop making 6-mile-per-gallon gas guzzlers.” Greg Papadopoulos SUN Chief Technology Officer “This is a seminal shift in the industry. If all you do is make chips smaller, you will hit the power crisis. I'm optimistic that it is an opportunity to do more holistic design that takes into account everything . . . not just performance.” Bernie Meyerson IBM Chief Technology Officer “What matters most to the computer designers at Google is not speed, but power – low power because data centers can consume as much power as a city.” Eric Schmidt Google CEO Renaissance Computing Institute
Power Consumption • Power has many implications • cost, for sure • but also physical system size and reliability • Blue Gene/L uses low power processors • that’s no accident • Moore’s law isn’t a birthright • CMOS scaling issues are now a challenge • power, junction size, fab line costs, … • Scaling also affects power and RAS • at ~50 nm feature size • static power (leakage) is comparable to dynamic (switching) power • leakage increases dramatically with operating temperature • SRAM soft error rate (SER) increased 30X (Intel) • when moving from 0.25 to 0.18 micron geometry and from 2 to 1.6V • ECC does not catch all errors (Compaq) • perhaps 10 percent uncaught • worse, cheap systems have no ECC Renaissance Computing Institute
Node Failure Challenges • Domain decomposition • spreads vital data across all nodes • each spatial cell exists in one memory • except possible ghost or halo cells • Single node failure • causes blockage of the overall simulation • data is lost and must be recovered • “Bathtub” failure model operating regimes • infant mortality • normal mode • late failure mode • Simple checkpointing helps; the optimum interval is roughly where is time to complete a checkpoint M is the time before failure R is the restart time due to lost work Burn in Late Failure Normal Aging Failure Rate Elapsed Time Renaissance Computing Institute
ASCI Q Petascale Reliability 1 hour reliability • Facing the issues • ASCI Q boot time is ~8 hours • not far from the system MTTF • application checkpoint frequency • MTTF 1/ = 1-r • A few assumptions • assume independent component failures • an optimistic and not realistic assumption • N is the number of processors • r is probability a component operates for 1 hour • R is probability the system operates for 1 hour • Then or for large N MTTF (hours) System Size Renaissance Computing Institute
ASCI White Availability (LLNL) Hardware failures dominate Source: Mark Seager Renaissance Computing Institute
Experimental Fault Assessment • Memory • random single bit flips • text, data, heap and stack • regular and floating-point registers • Message passing • random bit flips (MPI level) • payload corruption • single bit flip or multiple bit flip (burst error) • Failure modes • application crash • MPI error detected via MPI error handler • application detected via assertion checks • other(e.g., segmentation fault) • application hang (no termination) • application execution completion • correct or incorrect output (fault not manifest) Renaissance Computing Institute
Fault Code Suite Characteristics See C-D. Lu and D. A. Reed, “Assessing Fault Sensitivity in MPI Applications,” SC2004, to appear Source: Charng-da Lu Renaissance Computing Institute
NAMD: Applications Can Help • Output is correct if each number is identical • up to 3 decimal places to the result from a baseline run. • NAMD has assertion and range checks • certain messages and values (e.g. molecular velocities) Source: Charng-da Lu Renaissance Computing Institute
NAMD Working Sets Renaissance Computing Institute
CAM: Or Maybe Not • CAM from CCSM • community atmospheric model Source: Charng-da Lu Renaissance Computing Institute
What is a Petascale System? • Embrace failure, complexity, and scale • a mind set change Renaissance Computing Institute
Autonomic Behavior • Learn from biology • a cold is not fatal • systems build resistance • Petascale implications • monitor internal state • even if substantial resources are required • respond to warnings, not just failures • develop adaptation strategies Renaissance Computing Institute
ACPI Power Control • ACPI • Advanced Configuration and Power Management • HP, Intel, Microsoft, Phoenix, Toshiba, … • OS management of system power consumption • originally targeted at laptop/mobile device market • ACPI defines the following • hardware registers on chip • BIOS interfaces • Thermal failure is a big issue for HPC systems • monitor and react in many ways • processor clock speed based on code • disk spin down to conserve power/reduce heat Renaissance Computing Institute
SMART Disks • SMART • Self Monitoring, Analysis and Reporting Technology • on-disk monitoring and data analysis • ATA/IDE and SCSI support • Typical SMART capabilities • head flying height, data throughput, spin up time • reallocated sector count, seek error rate, seek time performance • spin retry count, drive calibration retry count, temperature • Drive spin up time (for example) • indicative of motor or bearing failure • By monitoring, one can identify • performance problems • failure probability Renaissance Computing Institute
Contract MonitorTask(s) Cluster Centroids Tolerance Rules Fuzzy Logic Decision Process Inputs Defuzzifier Outputs Fuzzifier Actuators Sensors System Sensors Actuators Power Consumption and Failures • Intelligent monitoring • Autopilot tagged sensors • SMART (disk temperature) • ACPI (thermal zone) • active cooling policy and throttling • lm_tools • CPU and board temperature • accessible via Autopilot Manager • statistical sampling for monitoring • failure prediction based on history • Failure model applications • adaptive checkpointing • batch queue selection Renaissance Computing Institute
Configuration x86 Linux cluster Myrinet interconnect Measurements microprocessor shown at left motherboard measured at six locations next slide Linpack Temperature Measurements Computation Terminated Celsius Celsius Renaissance Computing Institute
Thermal dynamics matter reliability and fault management power and economics Why? Arrhenius equation temperature implications mean time to catastrophic failure of commercial silicon 2X for every 10 C above 70 C Linpack Temperature Measurements Computation Terminated Celsius Celsius Renaissance Computing Institute
Failures and Autonomic Recovery • 106 hours for component MTTF • sounds like a lot until you divide by 105! • It’s time to take RAS seriously • systems do provide warnings • soft bit errors – ECC memory recovery • disk read/write retries, packet loss and retransmission • status and health provide guidance • node temperature/fan duty cycles • Software and algorithmic responses • diagnostic-mediated checkpointing • algorithm-based fault tolerance • domain-specific fault tolerance • loosely synchronous algorithms • optimal system size for minimum execution time LANL 10 TF Pink Node Temperature Renaissance Computing Institute
Software Evolution and Faults • Cost dynamics • people costs are rising • hardware costs are falling • Two divergent software world views • parallel systems • life is good – deus ex machina • Internet • we’ll all die horribly – trust no one • What does this mean for software? • abandon the pre-industrial “craftsman model” • adopt an “automated evolution” model Renaissance Computing Institute
Artificial Life Chaos Theory Genetic Algorithms Dynamical Systems Neural Networks Decentralized Control Genetic Programming Evolutionary Software: A Revolution? • Learn some biological lessons • environmental adaptation • homeostatic behavior and immunity • social structures and specialization • ants, termites, … • Evolve software components • adaptive software agents • interacting building blocks • challenges • define basic building blocks • specify evolutionary rules Renaissance Computing Institute
Fault Rules Fuzzy Sets Sensors One Possible Model Software Controls Fault Injection Fault Models Fitness Functions Defuzzifier Performance Measurement System Execution Candidate Libraries Genetic Programming Engine Fuzzy Logic Assessment System Behavior Fault Monitor Failure Indicators Fuzzifier Software Building Blocks Failure Indicators Performance Data Renaissance Computing Institute
Renaissance Computing Institute • Vision • a multidisciplinary institute • academe, commerce and society • broad in scope and participation • from art to zoology • Objectives • enrich and empower human potential • communities at all levels • create multidisciplinary partnerships • science, engineering and computing • commerce, humanities and the arts • develop and deploy leading infrastructure • driven by collaborative opportunities • computing, communications and data management • visualization, collaboration and manufacturing • enable and sustain economic development Renaissance Computing Institute