270 likes | 383 Views
CSG Research Computing Jim Pepin USC CTO/Director HPCC. HPCC. Provide common facilities and services for a large cross section of the university that requires leading edge computational and networking resources. Leverage USC central resources with externally funded projects. Overview.
E N D
HPCC • Provide common facilities and services for a large cross section of the university that requires leading edge computational and networking resources. • Leverage USC central resources with externally funded projects.
Overview • Sponsored by ISD (Information Services Division of USC) and ISI (Information Sciences Institute) • User community • ISI • LAS • Engineering • School of Medicine • IMSC • ICT • Others
Current Resources • High Performance Computing Resources • Linux Cluster (~1000nodes/2000cpus, 2Gb/sec Myrinet) • 20TB shared disk, 18GB - 40GB local disk per node. • Ranks in top 10 for academic clusters. • Myrinet switch is 768 nodes. • Adding nodes funded by USC research groups. • Sun Core Servers (E15k shared memory) • 72 processors, 288GB memory, 30TB shared disk • Mass Storage Facilities (Unitree) • 18,000 tape capacity
Funding Sources • ISD (University) Resources • 1.5M M/S and Equipment budget • Software/Maintenance .4M • Generic capital 1.0M • Other .1M • 3 FTEs direct support • 2 FTEs system staff offset • Los Nettos/LAAP • 2.0M • Condo Arrangements • 50k-250k one off capital purchases
Cluster Power Usage Math • 42 nodes/cabinet • 200 watts/node. • 8.4Kw/cabinet • 1000 nodes 24 cabinets • 1 control cabinet per 8 cabinets of compute servers • 8 control cabinets • 32 cabinets per 100 nodes • 268Kw per 1000 nodes • 100 Tons of a/c per 1000 nodes • Roughly 400KW total power use for 1000 nodes • 1500-2000 sq feet of space.
Current Software • Cluster software from IBM (xcat) is core of facility. • Stable production environment. • MPI is basic message passing • Globus/NMI work is proceeding with Carl’s help in funding plus ISD resources. • Leverages with campus need for global directory • More later. • Solaris and Unitree are core for Mass Storage support. • We need to look at other mass storage opportunities. • Issues • We need to be able to support faculty/researchers with tools and consulting to help them effectively use large-scale resources. • Many packages exist on HPCC resources, with no local support to help use them.
“Middleware” • Globus as base with NMI architecture for campus. • GT2 moving to GT3 • SCEC/ISI • Condor as lightweight job manager in user rooms • PBS/Maui on Cluster and Computation side of E15k • Issues • Kx509 bridge from Kerberos • USC PKI lite CA is base. • Only hosts and services. • NMI based. • Pubcookie (Kerberos back-end) • Uses host certs from PKI lite CA • Shib for some prototype library apps (scholar’s portal) • Campus GDS/PR using NMI schemes (eduperson etc)
HPCC Governance • HPCC faculty advisory group • Meets 4-5 times a year • Provides guidance to DCIO and CTO • “Final” Decisions are in ISD (CIO/DCIO) • Usual mode is agreement • Time allocation • No recharge • Large project reviewed by faculty allocation group • Some projects over 500k node hours • Condo users get dedicated nodes and cost sharing • Research leverage • Condo • Cost sharing • External funding • Grid construction • Next generation network
CTO/HPCC Projects • Advanced Networking Projects • Calren-2 • 2xGb service today . • 10Gb service in next 2 years. • Fiber/wavelength services(CENIC/National Lambda Rail) • Online for west coast. • Look at L2 possibilities to build shared ‘spaces’. • Look to leverage for project like Optiputer ITR. • 1 Wilshire colo facilities • See if we can use that space to facilitate ETF proposal. • Optiputer ITR as way to help network expansion.
CTO/HPCC Projects • Leverage HPCC efforts at ISI with ISD Resources • Clusters • Expand cluster to ~2000 nodes centrally owned. • Expand cluster for other groups (condo model). • Mass Storage • Look into large scale storage for groups like VHF project and other high end storage needs. (fractional petabytes) • Globus/NMI • Provide campus leadership for Global directory services and identity management. (authentication and authorization). • Networking Research
CTO/HPCC Projects • Fiber is a major part of the HPCC’s ability to service large scale computational needs. The following slides show what we have today and how it can be used.
Fiber Facilities • Lease dark fiber. • Started with dark fiber 3 years ago. • Pioneer in this area. • DWP (Department of Water and Power) • USC franchise area fiber for campus access. • Leverage new players (NLR/Cenic). • Use for USC, LAAP and Los Nettos projects. • Built-out today using low cost CWDM and 15540s. • 10Gbps ethernet backbone in place Fall ‘02 • Built-out fiber to Caltech/JPL/VHF(Shoah) and other Los Nettos sites.
Fiber Facilities • Lease more dark fiber. • Harvey Mudd. • Build second path to USC for disaster recovery. • Install DWDM gear from CENIC deal with Cisco. • 1Gb wavelengths in first phase (fall 04) • 10Gb wavelengths in summer 04. • Use to enable projects like Optiputer and ETF. • Experiment with optical switching hardware as ‘fiber patch panel’ for development of shared ‘computer centers’.
Original USC Fiber Backbone 4 strand SM DWP fiber 1 wilshire Downtown Clinic HSC UPC ISI ICT Original External fiber plant
Caltech JPL HMC 818 VHF 1 wilshire HSC Tustin UPC ISI fiber gigaman ICT Today’s Fiber and Gigaman circuits
Colo Facilities • Acquired space in 1 wilshire (original site). • 3 years ago. • DWP fiber is core. • Use to connect to exchanges and others ISPs. • Extend to potentially other ‘1 Wilshire’ buildings. • Use new Campus Level 3 fiber as means. • House routers and l2 equipment. • Provide space on USC campus for partners partners. • Enables Pacific Wave Exchange Point.
Exchange Point/Research 1 wilshire 818 7th Gb Gb Gb ports 100m ports 10Gb 10Gb ISI 10Gb HSC Gb ports 100mb ports UPC Foundry Bigirons 802.1q vlans Gb ports 100mb ports
Experimental Networking • Networking research community • California Institutes for Science and Innovation (CITRIS, CalIT2, Nano Systems, BioMedical) • San Diego Super Computer Center • CACR • ISI • Teragrid/Distributed Terascale Facility • UCSB/Dan Blumenthal optical labs
Future Resource Goals • High Performance Computing Resources • Linux Cluster (2048nodes/4096cpus, 2Gb/sec Myrinet) • 60TB shared disk, 36GB - 72GB local disk per node. • Rank in top 5 for academic clusters. • Start 64 bit nodes in summer 04. • Switch fabric will expand past 1024 nodes with ability to condo other users. • Plan to add more nodes funded by USC research groups (condo) Goal would be 3000+ nodes total. • Sun Core Servers (E15k shared memory) • 72 processors, 288GB memory, 300TB disk • Use this system for high end data users (large scale databases) and video users. • Mass Storage Facilities (Unitree today) • 18,000 tape capacity • PB online as goal in 3 years.
3 Year Strategy • Next step after 32 bit pentium. • Need to determine what will replace Xeons. One answer is opteron or IA64, but we need to start to develop clusters in this space and benchmark. • Much of the code will need reworking at user level. • Find ways to cost share with local cluster purchasers. “Condo” housing of medium to large clusters will be important. • Build “Grid-U”
3 Year Strategy • As cluster expand into the 2-4k node space power and A/C become significant issues (along with floor space). • We need to develop several major partners to allow HPCC to be the central piece of joint proposals from USC for such initiatives as ETF and future cyber infrastructure proposals. • Example is shared submission for Major Research Instrument grant.
3 Year Strategy • Networking Futures • Expand Exchange Point (R/E, Pacific Wave) • 10Gb at all sites • Layer 1 facilities (Optiputer type connections) • Re-design/RFP for campus network this month • Design network with ‘enclaves’ for research or academic support • Much higher internal bandwidth (10Gb core-core, at least 1Gb to all buildings 10Gb to major research centers) • How to provide comprehensive security without unacceptable friction.