140 likes | 480 Views
Tokyo Institute of Technology Commodity Grid Computing Infrastructure (and other Commodity Grid resources in Japan). Satoshi Matsuoka Professor, GSIC & Dept. Mathematical and Computing Sciences Tokyo Institute of Technology. “Commodity Grid” Resources Starting April, 2002 (under our control).
E N D
Tokyo Institute of TechnologyCommodity Grid Computing Infrastructure(and other Commodity Grid resources in Japan) Satoshi Matsuoka Professor, GSIC & Dept. Mathematical and Computing Sciences Tokyo Institute of Technology
“Commodity Grid” Resources Starting April, 2002(under our control) • 1. My Lab Machines: ~900 CPUs, ~2 TeraFlops, IA32 • 2. Titech (Campus) Grid Resources: ~800 CPUs, 1.3 TeraFlops, IA32 “Blades” • 3. “Nationwide” Commodity Grid Experiment: ~several hundred IA32 CPUs (2H2002) • Total: 1700~2000 processors in 2002 • 256 Processors for APGrid Testbed
Titech GSIC Matsuoka Lab Grid Cluster Infrastructure (4Q2001) 1H2002 Total: 6 Clusters, 890 Procs ~2 TFlops(Peak), >100TeraByte
Full Assistance from AMD, Donation of Athlon CPUs Gfarm/LHC/ATLAS, Bioinfomatic Apps Presto III(1), April, 2001 – 80 node Athlon 1.33Ghz 206 Gigaflops Peak Top500 2001/June, 439th The first-ever AMD-Powered Cluster on the Top500 PrestoIII(2) – Oct, 2001 Dual 256 proc AthlonMP 1.2Ghz 614 GigaFLOPS Peak Top500 2001/Nov. 331.7GigaFlops, 86th Presto III(3) – April 2002 – Dual 512 proc AthlonMP 1900+ 1.6 TeraFlops Peak, 1TFlops Linpack 100Terabytes (for LHC/ATLAS) Presto III Athlon Cluster (2001-2002)1.6 TeraFlops, 100Terabytes
Dependable, Fault-tolerant clustering for the Grid Parakeet Fault Tolerant MPI Fault Tolerant GridRPC Plug&Play Clustering Extended Parakeet Lucie Dynamic Cluster installer Heterogeneous Clustering Hetergeneous Omni OpenMP Heterogeneous High Performance Linpack (HPL) Muliple Cluster Coupling on the Grid Grid Projects on Clusters Ninf-G GridRPC SOAP/XML GridRPC prototype GFarm – middleware for Petascale data processing Grid Performance Benchmarking and Monitoring Bricks Parallel Grid Simulator JiPANG Jini-based Grid Portall Java for Cluster and the Grid OpenJIT Flexible high-performance JIT compiler JavaDSM – Secure and Portable Java DSM System for Clusters Titech Grid – Campus Grid Infrastructure Matsuoka Lab Grid Clustering Project
2. Background for theTitech Campus Commodity Grid • Titech GSIC operates 3 supercomputers • 16 proc SX-5, 256 proc Origin2K, 64 proc AlphaServer GS320, 400GFlops total • All are heavily utilized (99.9% for SX-5) • Annual rental budget: $5 million • We have 4 years left on 6-year rental contract • All may disappear from the Top500 in Heidelberg (June, 2002) • We don’t have large extra money for new stuff or staff • Chicken and Eggs Problem for Grid Adoption • Most Japanese SC centers share the same problem
Titech Campus Grid - System Image • Titech Grid is a large-scale, campus-wide, pilot commodity Grid deployment for next generation E-Science application development within the Campuses of Tokyo Institute of Technology (Titech) • High-density blade PC server systems consisting of 800 high-end PC processors installed at 13 locations throughout the Titech Campuses, interconnected via the Super TITANET backbone. • The first campus-wide pilot Grid system deployment in Japan, providing next-generation high-performance “virtual parallel computer” infrastructure for high-end computational E-Science. Suzukake-dai Campus 30km Super TITANET(1-4Gbps) NEC Express 5800 Series Blade Servers 24-processor Satellite Systems @ each department ×12 systems Oo-okayama Campus Grid-wide Single System Image via Grid middleware: Globus, Ninf-G, Condor, NWS, … GSIC Main Servers (256 processers) x 2 systems in just 5 cabinets Super SINET (10 Gbps MOE National Backbone Network) to other Grids 800-processor high-perf blade servers, > 1.2 TeraFlops, over 25 Terabytes stoarge
Titech Grid Campus Sites • 15 installation sites amongst 2 campuses • 18 participating departments • Univ.-wide solicitation and applications thereof • Each department lists its own apps – Bioinfo, CFD, Nanotech, Env. Sci, etc.etc. Comp. Eng. C Math. Comp.Science C Oo-okayama (10 sites) Suzukakedai (5 sites)
How we implement the Field of Dreams • Fact: Departments lack space, power, air-conditioning, maintenance expertise etc. • Technological solutions • High density, high performance blade design • x2 density c.f., 1U rack server design • GSIC – 512 P3 1.4G in just 5 19inch racks • Department – 24 P3 1.4G in a small desk-sized unit, can be run off a wall plug (Just 2K Watts/Cluster) • High Operational Temperature (33 degrees Celsius) • Employ remote server management technologies for low-level management • Need to be firewall friendly • And of course, all the Cluster & Grid middleware • Globus, Condor, Sun GridEngine, NWS, MPICH-G, Cactus, Ninf-G, Gfarm, Lucie… • The first Titech Grid is just a operational prototype • Proposal for 60 TeraFlops, 1.6 Petabyte Campus Grid Federation w/Univ. Tsukuba
3. Nationwide Commodity Grid 4) Grid-Enabled, Terascale Mathematical Optimization Libraries and Apps -Non-Convex Quadratic Optimizaion using SCRM -Higher-order polynomial solving w/Homotopy meth. -BMI optimization for control theory apps -Parallel GA for Genome Informatics apps 2) Highly-Reliable Commodity Cluster Middlewarea) Nonintrusive FTb) Dynamic Plug&Play c) Heterogeneity x1+x2+x3=c1x1x2+x1x3+x2x3=c2x1x2x3=c3Cyclic Polynomial All-Solutions 3) Scalable and FT extensions for GridRPCa) >million task parallelism b) FT under various fault models c) High-level and generalized GridRPC-API Structural Optimization Probs Protein NMR structural prediction GridRPC Titech (Super)SINET Kyoto-U AIST Tokushima-U Objective: sustain 1 Teraflop for a week at 1/100 cost 1) Commodty Nationwide PC Cluster Testbed > 1000 processors, multi-teralops
Ex. TeraScale GA Optimization Challenges on the Grid Apply to difficult problem domain where experts spend considerable time in try&error in optimal solution finding Example1: Lens Design Search for curvature, spacing, material, etc. to achieve optimal image takes an expert weeks, GA-based design in few minutes NMR Protein Structure Analysis Optimize protein structure according to the observed NMR signal (takes months ~ year by an expert) Evaluation of Solution Extremely costly require massive parallelization (millions of GridRPC calls) Plan to determine structures of up to 30,000Da class proteins, limit of current NMR scan
APGrid Issues • The everlasting Private Address Issue • Globus CANNOT SPAN across private addresses • VPN currently the only solution • Grid folks in US (and EU) are too ignorant • Private Addresses a Norm in Business Computing • Should really move on to IPv6 and IPSec • CA/RA/CP Issue • Membership issues as well – Do we support > 1000 users? • Resource Brokering • Stable Testbed • Can’t just be something that goes up temporarily • Systems Testbed or Applications Testbed • But maybe not enough resources for semi-production