1 / 15

QMUL e-Science Research Cluster

Introduction (New) Hardware Performance Software Infrastucture What still needs to be done. QMUL e-Science Research Cluster. Background. Formed e-Science consortium within QMUL to bid for SRIF money etc (no existing central resource) Received money in all 3 SRIF rounds so far.

sidone
Download Presentation

QMUL e-Science Research Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction (New) Hardware Performance Software Infrastucture What still needs to be done QMUL e-Science Research Cluster

  2. QMUL e-Science Research Cluster Background Formed e-Science consortium within QMUL to bid for SRIF money etc (no existing central resource) Received money in all 3 SRIF rounds so far. Led by EPP + Astro + Materials+ Engineering Started from scratch in 2002, new machine room, Gb networking. Now have 230 kW of A/C Differing needs other fields tend to need parallel processing support MPI etc. Support effort a bit of a problem.

  3. QMUL e-Science Research Cluster History of the High Throughput Cluster Already in its 4th year (3 installation phases) In addition Astro Cluster of ~70 machines

  4. QMUL e-Science Research Cluster

  5. QMUL e-Science Research Cluster

  6. QMUL e-Science Research Cluster 280 + 4 dual dual – core 2 Ghz Opteron nodes 40 + 4 with 8 Gbyte remainder with 4 Each with 2 x 250 Gbyte HD 3-COM Superstack 3 3870 network stack Dedicated second network for MPI traffic APC 7953 vertical PDU's Total measured power usage seems to be ~1A/machine ~ 65-70 kW total

  7. QMUL e-Science Research Cluster Crosscheck:

  8. QMUL e-Science Research Cluster Ordered last week in March 1st batch of machines delivered in 2 weeks 5 further batches 1 week apart 3 week delay for proper PDU's Cluster cabled up and powered 2 weeks ago Currently all production boxes running legacy sl3/x86 Issues with scalability of services torque/ganglia. Also shared experimental area is I/0 bottleneck

  9. QMUL e-Science Research Cluster

  10. QMUL e-Science Research Cluster Cluster has been fairly heavily used ~40-45% on average

  11. QMUL e-Science Research Cluster Tier-2 Allocations

  12. QMUL e-Science Research Cluster S/W Infrastructure MySQL database containing all static info about machines and other hardware + network + power configuration Keep s/w configuration info in a subversion database: os version and release tag Automatic (re)installation and upgrades using a combination of both, tftp/kickstart pulls dynamic pages from web (Mason).

  13. QMUL e-Science Research Cluster http://www.esc.qmul.ac.uk/cluster/

  14. QMUL e-Science Research Cluster Ongoing work Commission SL4/x86_64 service (~30% speed improvement) (assume non-hep usage initially). Able to migrate boxes on demand. Tune MPI performance for jobs upto ~160 CPUs (non-ip protocol?) Better integrated monitoring (ganglia +pbs + opensmart? + existing db) dump Nagios? Add 1-wire Temp + power sensors.

  15. QMUL e-Science Research Cluster Ongoing work continued Learn how to use large amount of distributed storage in efficient and robust way. Need to provide a POSIX f/s ( probably extending poolfs or something like lustre )

More Related