390 likes | 503 Views
NCAR’s Response to upcoming OCI Solicitations. Richard Loft SCD Deputy Director for R&D. Outline. NSF Cyberinfrastructure Strategy (Track-1 & Track-2) NCAR generic strategy for NSFXX-625’s (Track-2) NCAR response to NSF05-625 NSF Petascale Initiative Strategy
E N D
NCAR’s Response to upcoming OCI Solicitations Richard Loft SCD Deputy Director for R&D
Outline • NSF Cyberinfrastructure Strategy (Track-1 & Track-2) • NCAR generic strategy for NSFXX-625’s (Track-2) • NCAR response to NSF05-625 • NSF Petascale Initiative Strategy • NCAR response to NSF Petascale Initiative
NSF’s Cyberinfrastructure Strategy • The NSF’s HPC acquisition strategy (through FY10) for HPC is for three Tracks: • Track 1: High End O(1 PFLOPS sustained) • Track 2: Mid level system O(100 TFLOPS) NSFXX-625 • First instance (NSF05-625) submitted Feb 10, 2006 • Next instances due: • November 30, 2006 • November 30, 2007 • November 30, 2008 • Track 3: Typical University HPC O(1-10 TFLOPS) • The purpose of the Track-1 system will be to achieve revolutionary advancement and breakthroughs in science and engineering.
Solicitation NSF05-625:Towards a Petascale Computing Environment for Science and Engineering • Award: September 2006 • System in production by May 31, 2007 • $30,000,000 or $15,000,000. • Operating costs funded under separate action. • RP serves the broad science community - open access. • Allocations by LRAC/MRAC or “their successors” • Two 10 Gb/s TeraGrid links
NCAR’s Overall NSFXX-625 Strategy • Leverage NCAR/SCD expertise in production HPC. • Get a production system - • No white box Linux solutions. • Stay on path to usable petascale systems • NCAR is a Teragrid outsider - must address two areas: • Leverage experience with general scientific users • Lack of Grid consulting experience • Emphasize, but don’t over emphasize, geosciences. • In proposing, NCAR has a facility problem • Minimize costs - power, administrative staff, level of support. • Creative plan for remote user support and education.
NSF05-625 Partners • Facility Partner • End-to-End System Supplier • User Support Network - • NCAR Consulting Service Group • University partners
NSF05-625 Facility Partner • NCAR ML Facility after ICESS is FULL. • Key Points: • A new datacenter is needed whether NCAR wins the NSF05-625 solicitation or not. • Because of the short timeline, new datacenter never factors into the strategy for NSFXX-625. • Identified a colocation facility • facility features • local (Denver-Boulder area) • State of the Art, High Availability Center • Currently 4 x 2MW generators of power available • Familiar with large scale deployments • Dark Fibre readily available (good connectivity)
NSF05-625 Supercomputer System Details • Two systems: capability + capacity • ~80 Tflops combined • Robotic tape storage system ~12PB
NCAR NSF05-625 User Support Plan • Largest potential differentiator in proposal - let’s do something unique! • System will be used by the generic scientist -support plan must • Be extensible to other domains than geoscience • Address grid user support • Strategy leverages OSCER-lead IGERT proposal- • Combine teaching of computational science with user support • Embed application support expertise in key institutions • Build education and training materials through university partnerships.
Track-1 System Background • Source of funds: Presidential Innovation Initiative announced in SOTU. • Performance goal:1 PFLOPSsustained on “interesting problems”. • Science goal: breakthroughs • Use model: 12 research teams per year using whole system for days or weeks at a time. • Capability system - large everything & fault tolerant. • Single system in onelocation. • Not a requirement that machine be upgradable.
Track-1 Project Parameters • Funds:$200M over 4 years, starting FY07 • Single award • Money is for end-to-end system (as in 625) • Not intended to fund facility. • Release of funds tied to meeting hw and sw milestones. • Deployment Stages: • Simulator • Prototype • Petascale system operates: FY10-FY15 • Operations funds FY10-15 funded separately.
Two Stage Award Process Timeline • Solicitation out: May, 2006 (???) • [ HPCS down-select: June, 2006 ] • Preliminary Proposal due: August, 2006 • Down selection (invitation to 3-4 to write Full Proposal) • Full Proposal due: January, 2007 • Site visits: Spring, 2007 • Award: Sep, 2007
NSF’s view of the problem • NSF recognizes the facility (power, cooling, space) challenge of this system. • Therefore NSF welcomes collaborative approaches: • University & Federal Lab • University & commercial data center • University & State Government • University consortium • NSF recognizes that applications will need significant modification to run on this system. • User support plan • Expects proposer to discuss needs in this area with experts in key applications areas.
The Cards in NCAR’s Hand • NCAR … • Is a leader in making the case that geoscience grand challenge problems need petascale computing. • Has many grand challenge problems to offer itself. • Has experience at large processor counts. • Has recently connected to the TeraGrid, and is moving towards becoming a full-fledged Resource Provider.
NCAR Response Options • Do Nothing • Focus on Petascale Geoscience Applications • Partner with a lead institution or consortium • Lead a Tier-1 proposal
NCAR Response Options • Do Nothing • Focus on Petascale Geoscience Applications • Partner with a lead institution or consortium • Lead a Tier-1 proposal
The Relationship Between OCI’s Roadmap and NCAR’s Datacenter project Richard Loft SCD Deputy Director for R&D
Projected CCSM Computing Requirements Exceed Moore’s Law Thanks to Jeff Kiehl/Bill Collins
NSF’s Cyberinfrastructure Strategy • The NSF’s HPC acquisition strategy (through FY10) for HPC is for three Tracks: • Track 1: High End O(1 PFLOPS sustained) • Track 2: Mid level system O(100 TFLOPS) NSFXX-625 • First instance (NSF05-625) submitted Feb 10, 2006 • Next instances due: • November 30, 2006 • November 30, 2007 • November 30, 2008 • Track 3: Typical University HPC O(1-10 TFLOPS) • The purpose of the Track-1 system will be to achieve revolutionary advancement and breakthroughs in science and engineering.
NCAR strategic goals: • NCAR will stay in the top echelon of geoscience computing centers. • NCAR’s immediate strategic goal is to be a Track-2 center. • To do this, NCAR must be integrated with NSF’s cyberinfrastructure plans. • This means both connecting and ultimately operating within the Teragrid framework. • The Teragrid is evolving, so this is a moving target.
NCAR new-facility • NCAR ML Facility after ICESS is FULL. • Key Points: • A new datacenter is needed whether NCAR wins the NSF05-625 solicitation or not. • Because of the short timeline, a new datacenter never factors into the strategy for NSFXX-625. • Right now, we can’t handle a modest budget augmentation for computing with the current facility.
Mesa Lab is full after the ICESS procurement • ICESS = Integrated Computing Environment for Scientific Simulation • We’re sitting at 980 kW right now. • Deinstall of bluesky will give us back 450 kW. • This leaves about 600 kW of head-room. • The ICESS procurement is expected to deliver a system with a maximum power requirement of 500-600 kW of power. • This is not enough to house $15M-$30M of equipment from NSF05-625, for example.
We’re fast running out of power… Max power at the Mesa Lab is 1.2 MW!
Preparing for the Petascale Richard Loft SCD Deputy Director for R&D
What to expect in HEC? • Much more parallelism. • A good deal of uncertainty regarding node architectures. • Many threads per node. • Continued ubiquity of Linux/Intel systems. • There will be vector systems • Emergence of exotic architectures. • Largest (petascale) system likely to have special features • Power aware design (small memory?) • Fault tolerant design features • Light-weight compute node kernels • Custom networks
HEC in 2010 • Based on history, should expect 4K-8K CPU systems to be commonplace by the end of the decade. • The largest systems on the Top500 list should be 1-10 PFLOPS. • Parallelism in largest system - estimate (2010). • Assume a clock speed of 5 GHz a double FMA CPU delivers 20 GFLOPS peak • 1 PFLOPS peak = 50K CPU’s. • 10 PFLOPS peak = 500K CPU’s • Large vector systems (if they exist) will still be highly parallel. • To justifying using the largest systems, must use a sizable fraction of the resource.
Range of Plausible Architectures: 2010 • Power issues will slow rate of increase in clock frequency. • This will drive trend towards massive parallelism. • All scalar system with have multiple CPU’s per socket (chip). • Currently 2 CPU’s per core, by 2008, 4 CPU’s per socket will be common place. • 2010 scalar architectures will likely continue this trend. 8 CPU’s are possible - Cell Chip already has 8 synergistic processors. • Key unknown is which architecture for a cluster on a chip will be most effective. • Vector systems will be around, but at what price? • Wildcards • Impact of DARPA HPCS program • Exotics: FPGA’s, PIM’s, GPU’s.
How to make science staff aware ofcoming changes? • NCAR must develop a science driven plan for exploiting petascale systems at the end of the decade. • Briefed NCAR Director, DD, CISL and ESSL Directors • Meetings (SEWG at CCSM Breckenridge) • Organizing NSF workshops on petascale geoscience benchmarking scheduled at DC (June 1-2) and NCAR (TBD) • Have initiated internal petascale discussions • CGD-SCD joint meetings • Peta_ccsm mail list. • Peta_ccsm Swiki site. • Through activities like this. NSA should take leadership role.
What must be done to secure resources to improve scalability? • Must help ourselves. • Invest judiciously in computational science where possible. • Leverage application development partnerships (SciDAC, etc.) • Write proposals. • Support for applications development for the Track-1 system can be built into a NCAR partnership deal. • NSF has indicated an independent funding track for applications. NCAR should aggressively pursue those funding sources. • New ideas can help - e.g. POP
POP Space Filling Curves: partition for 8 processors Credit: John Dennis, SCD
POP 1/10 Degree performance BG/L SFC improvement
Top 500 Processor Types: Intel taking over Today Intel is inside 2/3 of the Top500 machines
The commodity onslaught … • The Linux/Intel cluster is taking over Top500. • Linux has not penetrated at major Weather, Ocean, Climate centers- yet - reasons • System maturity (SCD experience) • Scalability of dominant commodity interconnects • Combinatorics (Linux flavor, processor, interconnect, compiler) • But it affects NCAR indirectly because… • Ubiquity = Opportunity • Universities are deploying them. • NCAR must rethink services provided to the Universities. • Puts strain on all community software development activities.