130 likes | 211 Views
UC3 Shared Research Computing Service. Gary Jung, LBNL UC3 Technical Architecture Team UC Grid Summit April 1, 2009. Background. UC Guidance Council chartered to identify and recommend strategic directions to guide future IT investments and academic information environment
E N D
UC3 Shared ResearchComputing Service Gary Jung, LBNL UC3 Technical Architecture TeamUC Grid Summit April 1, 2009
Background • UC Guidance Council chartered to identify and recommend strategic directions to guide future IT investments and academic information environment • Sustain and enhance academic quality and competitiveness was a primary goal • Recommendation to develop UC Research Cyber Infrastructure (CI) Services • High Performance Research Computing component • UC CI Planning and Implementation Committee formed • Proposed Pilot • Phase 1: Deploy 2 moderately-sized institutional clusters • Phase 2: Extend this infrastructure to other hosting models and refine application support and service model. Connect interested UC campuses to UC Grid pilot.
Phase 1: Pilot Proposals • 32 proposals received from all campuses (except UCM), LANL and LBNL • 24 projects suitable for running on clusters • Research Areas represented in selected proposals include: • Astrophysics • Bioinformatics • Biology • Biophysics • Climate Modeling • Computational Chemistry • Computational Methods • Genomics • Geosciences • Material Sciences • Nanosciences • Oceanic Modeling
UC3 Phase 1: Implementation Architectural Principles Create a consistent user experience. Identical use policies, administrative practices, and help mechanisms. Minimize disorientation when moving between clusters Not necessarily binary compatible Design with future requirements in mind. Shared filesystems mutual disaster recovery Future expansion of compute or storage metascheduling capability tighter integration Respect local practices. Operational practices at the two sites differ. Okay as long differences are transparent to users Build a balanced system. Goal is a general purpose resource suitable for a broad scientific use 3
UC3 Phase 1: Implementation Technical Architecture Compute 2 ea. 256 node dual-socket, quad-core processor Linux cluster Spec will be for Nehalem processor, but alternatives allowed RFQ will ask for pricing on multiple processor speeds so that review team can consider price/performance trade-offs 16GB per node or 2GB/core. 24GB option Fabric spec will be ConnectX IB, QLogic Truscale IB, but vendors can additionally bid Myrinet 10G Single director class switch to provide full bi-section bandwidth 4
UC3 Phase 1: Implementation Storage Architecture Two-tier storage solution Stable, robust, enterprise NFS for home filesystem Parallel filesystem for scratch. No use of local disks. Parallel filesystem needs to provide adequate performance to negate the need for local disk. Will consider turnkey solutions because of minimal ongoing staffing 5
UC3 Phase 1: Implementation Technical Architecture Home Directory Considerations 24 PIs/projects with about 10 users per project Assuming 10GB per user for home directory use Assuming 1-2TB per project Suggest minimum size of 50TB for home directories and backups 240 users x 10GB = 2.4TB for users 1TB per project x 24 projects = 24TB Low maintenance NFS appliance is desired. Parallel Filesystem Considerations Lustre is likely choice due to availability and cost Terascala provides turnkey Lustre solution. Currently under eval at LBNL Another consideration would be DIY Lustre. Initial and ongoing support effort will be a factor in deciding. 6
UC3 Phase 1: Procurement Procurement High profile opportunity for vendors 3 major procurements - Clusters, NFS storage, Parallel Filesystem Storage Single procurement for each major component. Scored and weight evaluation criteria No acceptance criteria other than demonstated compatibility/integration requirements as specified in subcontract. 8
UC3 Phase 1: Timeline Schedule Feb - Develop spec for major components Mar - Finalize RFQs Apr - Issue Cluster and NFS Storage RFQ. Vendor responses due late-Apr May - Issue Cluster and NFS Storage award early May. Jun - Delivery of Cluster and NFS Storage hardware. Install. Late Jun - Issue RFQ for Parallel Filesystem Storage. Jul - Available for early users Aug - Add Parallel Filesystem Storage 9
UC3 Phase 1: User Services User Experience Shared Logins across systems Agreement on uniform UID space Agreement on Centos 5.3 operating system, OpenMPI, Moab scheduler. Still need to discuss filesystem layout NFS cross-mounting of home directories across L2 network Need to work out Help desk procedures Ticket system Web site Documentation How to get help 10
UC3 Phase 1: Governance Governance Oversight board consisting of stakeholders to be established Scope will include Policy and definition of metrics Compute and storage allocations System configuration details (e.g. scheduler priority) What happens after the 2yr pilot? Who gets access? Additional users Strategy to make sustainable Condo clusters 11
UC3 Phase 1: Open Issues UC Grid Technical Issues How would we configure UC Grid for UC3? OTP support Moab scheduler support Integration into Gold Banking system for allocations Other Issues How might we implement a UC-wide distributed user services team? How do we build the customer relationships? 12