140 likes | 422 Views
DOE Perspective on Cyberinfrastructure - LBNL Gary Jung Manager, High Performance Computing Services Lawrence Berkeley National Laboratory Educause CCI Working Group Meeting November 5, 2009. Midrange Computing
E N D
DOE Perspective on Cyberinfrastructure - LBNLGary JungManager, High Performance Computing ServicesLawrence Berkeley National LaboratoryEducause CCI Working Group MeetingNovember 5, 2009
Midrange Computing • DOE ASCR hosted a workshop in Oct 2008 to assess the role of mid-range computing in the Office of Science and revealed that this computation continues to play an increasingly important role in enabling the Office of Science. • Although it is not part of ASCR's mission, midrange computing, and the associated data management play a vital and growing role in advancing science in disciplines where capacity is as important as capability. • Demand for midrange computing services is… • growing rapidly at many sites (>30% growth annually at LBNL) • the direct expression of a broad scientific need • Midrange computing is a necessary adjunct to leadership-class facilities November 5, 2009
Berkeley Lab Computing • Gap between desktop and National Centers • Midrange Computing Working Group 2001 • Cluster support program started in 2002 • Services for PI-owned clusters include: Pre purchase consulting; development of specs and RFP, facilities planning, installation and configuration, ongoing cluster support, user services consulting, cybersecurity, computer room colocation • Currently 32 clusters in production, over 1400 nodes, 6500 processor cores • Funding: Institution provides support for infrastructure costs, technical development. Researchers pay for cluster and incremental cost of support. November 5, 2009
Cluster Support Phase II: Perceus Metacluster • All clusters interconnected into shared cluster infrastructure • Permits sharing of resources, storage • Global home file system • One ‘super master’ node, used to boot nodes across all clusters • multiple system images supported • One master job scheduler, submitting to all clusters • Simplifies provisioning new systems and ongoing support • Metacluster model made possible by Perceus software • successor to Warewulf (http://www.perceus.org) • can run jobs across clusters, recapturing stranded capacity. November 5, 2009
Laboratory-Wide Cluster - Drivers “Computation lets us understand everything we do.” – LBNL Acting Lab Director Paul Alivisatos 38% of scientists depend on cluster computing for research. 69% of scientists are interested in cycles on a Lab-owned cluster. • early-career scientists twice as likely to be ‘very interested’ than later-career peers Why do scientists at LBNL need midrange computing resources? • ‘on ramp’ activities in preparation for running at supercomputing centers (development, debugging, benchmarking, optimization) • scientific inquiry not connected with ‘on ramp’ activities November 5, 2009
Laboratory-Wide Cluster “Lawrencium” • Overhead funded program • Capital equipment dollars shifted from business computing • Overhead funded staffing - 2 FTE • Production in Fall 2008 • General purpose Linux cluster suitable for a wide range of applications • 198-nodes, 1584 cores, DDR Infiniband interconnect • 40TB NFS home directory storage; 100TB Lustre parallel scratch • Commercial job scheduler and banking system • #500 on the Nov 2008 Top500 • Open to all LBNL PIs and collaborators on their project • Users are required to complete a survey when applying for accounts and later provide feedback on science results • No user allocations at this time. This has been successful to date. November 5, 2009
Networking - LBLNet • Peer at 10GBE with ESNET • 10GbE at core. Moving to 10GbE to the buildings • Goal is sustained high speed data flows with cybersecurity • Network based IDS approach - traffic is innocent until proven guilty • Reactive firewall • Does not impede data flow. no stateful firewall. • Bro cluster allows us to scale our IDS to 10GBE November 5, 2009
Communications and Governance • General announcements at IT council • Steering committees used for scientific computing • Small group of stakeholders, technical experts, decision makers • Helps to validate and communicate decisions • Accountability November 5, 2009
Challenges • Funding (past) • Difficult for IT to shift funding from other areas of computing to support for science • Recharge can constrain adoption. Full cost recovery definitely will. • New Technology (ongoing) • Facilities (current) • Computer room is approaching capacity despite upgrades • Environmental Monitoring • Plenum in ceiling converted to hot air return • Tricks to boost underfloor pressure • Water cooled doors • Underway • DCIE measurement in process • Tower and heat exchanger replacement • Data Center container investigation November 5, 2009
Next Steps • Opportunities presented by cloud computing • Amazon investigation earlier this year. Others ongoing • Latency sensitive applications ran poorly as expected • Performance dependent of specific use case • Data migration. Economics of storing vs moving • Certain LBNL factors favor costs for build instead of buy • Large storage and computation for data analysis • GPU investigation November 5, 2009
Points of Collaboration • UC Berkeley HPCC • Recent high profile joint projects between UCB and LBNL encourages close collaboration • 25-30% of scientists have dual appointment • UC Berkeley proximity to LBNL facilitates the use of cluster services • University of California Shared Research Computing Services pilot (SRCS) • LBNL and SDSC joint pilot for the ten UC campuses • Two 272-node clusters located at UC Berkeley and SDSC • Shared computing is more cost-effective • Dedicated CENIC L3 connecting network for integration • Pilot consists of 24 research projects November 5, 2009