150 likes | 250 Views
Introduction (New) Hardware Performance Software Infrastucture What still needs to be done. QMUL e-Science Research Cluster. Background. Formed e-Science consortium within QMUL to bid for SRIF money etc (no existing central resource) Received money in all 3 SRIF rounds so far.
E N D
Introduction (New) Hardware Performance Software Infrastucture What still needs to be done QMUL e-Science Research Cluster
QMUL e-Science Research Cluster Background Formed e-Science consortium within QMUL to bid for SRIF money etc (no existing central resource) Received money in all 3 SRIF rounds so far. Led by EPP + Astro + Materials+ Engineering Started from scratch in 2002, new machine room, Gb networking. Now have 230 kW of A/C Differing needs other fields tend to need parallel processing support MPI etc. Support effort a bit of a problem.
QMUL e-Science Research Cluster History of the High Throughput Cluster Already in its 4th year (3 installation phases) In addition Astro Cluster of ~70 machines
QMUL e-Science Research Cluster 280 + 4 dual dual – core 2 Ghz Opteron nodes 40 + 4 with 8 Gbyte remainder with 4 Each with 2 x 250 Gbyte HD 3-COM Superstack 3 3870 network stack Dedicated second network for MPI traffic APC 7953 vertical PDU's Total measured power usage seems to be ~1A/machine ~ 65-70 kW total
QMUL e-Science Research Cluster Crosscheck:
QMUL e-Science Research Cluster Ordered last week in March 1st batch of machines delivered in 2 weeks 5 further batches 1 week apart 3 week delay for proper PDU's Cluster cabled up and powered 2 weeks ago Currently all production boxes running legacy sl3/x86 Issues with scalability of services torque/ganglia. Also shared experimental area is I/0 bottleneck
QMUL e-Science Research Cluster Cluster has been fairly heavily used ~40-45% on average
QMUL e-Science Research Cluster Tier-2 Allocations
QMUL e-Science Research Cluster S/W Infrastructure MySQL database containing all static info about machines and other hardware + network + power configuration Keep s/w configuration info in a subversion database: os version and release tag Automatic (re)installation and upgrades using a combination of both, tftp/kickstart pulls dynamic pages from web (Mason).
QMUL e-Science Research Cluster http://www.esc.qmul.ac.uk/cluster/
QMUL e-Science Research Cluster Ongoing work Commission SL4/x86_64 service (~30% speed improvement) (assume non-hep usage initially). Able to migrate boxes on demand. Tune MPI performance for jobs upto ~160 CPUs (non-ip protocol?) Better integrated monitoring (ganglia +pbs + opensmart? + existing db) dump Nagios? Add 1-wire Temp + power sensors.
QMUL e-Science Research Cluster Ongoing work continued Learn how to use large amount of distributed storage in efficient and robust way. Need to provide a POSIX f/s ( probably extending poolfs or something like lustre )