1 / 19

Dr Paul Calleja Director HPCS

The SKA The worlds largest big-data project. Dr Paul Calleja Director HPCS. HPCS activities & focus. Square Kilometre Array - SKA. Next generation radio telescope 100 x more sensitive 1000000 X faster 5 square km of dish over 3000 km The next big science project

libby
Download Presentation

Dr Paul Calleja Director HPCS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The SKA The worlds largest big-data project Dr Paul Calleja Director HPCS

  2. HPCS activities & focus

  3. Square Kilometre Array - SKA • Next generation radio telescope • 100 x more sensitive • 1000000 X faster • 5 square km of dish over 3000 km • The next big science project • Currently the worlds most ambitious IT • Project • Cambridge lead the computational design • HPC compute design • HPC storage design • HPC operations

  4. SKA location A Continental sized Radio Telescope • Needs a radio-quiet site • Very low population density • Large amount of space • Two sites: • Western Australia • Karoo Desert RSA

  5. What is radio astronomy s Astronomical signal (EM wave) Detect & amplify B 1 2 Digitise & delay Correlate X X X X X X Integrate SKY Image Process Calibrate, grid, FFT

  6. Why SKA – Key scientific drivers • Evolution of galaxies • Exploring the • dark ages • Pulsar survey • gravity waves • Cosmic Magnetism • Are we alone ???

  7. SKA timeline • 2019 Operations SKA1 2024: Operations SKA2 • 2019-2023 Construction of Full SKA, SKA2€1.5B • 2016-2019 10% SKA construction, SKA1€300M • 2012 Site selection • 2012 - 2015 Pre-Construction: 1 yr Detailed design€90M • PEP 3 yrProduction Readiness • 2008 - 2012 System design and refinement of specification • 2000 - 2007 Initial concepts stage • 1995 - 2000 Preliminary ideas and R&D

  8. SKA project structure SKA Board Director General Advisory Committees (Science, Engineering, Finance, Funding …) Project Office (OSKAO) Locally funded Work Package Consortium n Work Package Consortium 1 …

  9. Work package breakdown System Science Maintenance and support /Operations Plan Site preparation Dishes Aperture arrays Signal transport Data networks Signal processing Science Data Processor Monitor and Control Power SPO UK (lead), AU (CSIRO…), NL (ASTRON…) South Africa SKA, Industry (Intel, IBM…)

  10. SKA data flow 20 Gb/s 4Pb/s 16 Tb/s 24Tb/s 1000Tb/s 20 Gb/s

  11. Science data processor pipeline Corner Turning Course Delays Fine F-step/ Correlation Visibility Steering Observation Buffer Gridding Visibilities Image Storage Imaging: Imaging HPC science processing Switch Buffer store Switch Buffer store Bulk Store Incoming Data from collectors Image Processor Correlator Beamformer UV Processor … Corner Turning Course Delays Beamforming/ De-dispersion Beam Steering Observation Buffer Time-series Searching Search analysis Object/timing Storage Non-Imaging: 3200 GB/s 135 PB 200 Pflop 2.5 Eflop 10 Pflop 1 Eflop 100 Pflop SKA 1 300 PB 1 Eflop SKA 2 3 EB 128,000GB/s 5.40 EB Software complexity

  12. SKA Exascale computing in the desert • The SKA SDP compute facility will be at the time of deployment one of the largest HPC systems in existence • Operational management of large HPC systems is challenging at the best of times - When HPC systems are housed in well established research centres with good IT logistics and experienced Linux HPC staff • The SKA SDP will be housed in a desert location with little surrounding IT infrastructure, with poor IT logistics and little prior HPC history at the site • Potential SKA SDP exascale systems are likely to consist of 100,000 nodes occupy 800 cabinets and consume 30 MW. This is very large – around 5 times the size of today largest supercomputer –Titan Cray at Oakridge national labs. • The SKA SDP HPC operations will be very challenging

  13. The challenge is tractable • Although the operational aspects of the SKA SDP exacscale facility are challenging they are tractable if dealt with systematically and in collaboration with the HPC community.

  14. SKA HPC operations – functional elements • We can describe the operational aspects by functional element • Machine room requirements ** • SDP data connectivity requirements • SDP workflow requirements • System service level requirements • System management software requirements** • Commissioning & acceptance test procedures • System administration procedure • User access procedures • Security procedure • Maintenance & logistical procedures ** • Refresh procedure • System staffing & training procedures **

  15. Machine room requirements • Machine room infrastructure for exascale HPC facilities is challenging • 800 racks, 1600M squared • 30MW IT load • ~40 Kw of heat per rack • Cooling efficiency and heat density management is vital • Machine infrastructure at this scale is in the £150M bracket with a design and implementation time sale of 2-3 years • The power cost alone at todays cost is £30M per year • Desert location presents particular problems for data centre • Hot ambient temperature - difficult for compressor less cooling • Lack of water - difficult for compressor less cooling • Very dry air - difficult for humidification • Remote location - difficult for DC maintenance

  16. System management software • System management software is the vital element in HPC operations • System management software today does not scale to exascale • Worldwide coordinated effort to develop system management software for exascale • Elements of system management software stack:- • Power management ** • Network management • Storage management • Workflow management • OS • Runtime environment ** • Security management • System resilience ** • System monitoring ** • System data analytics ** • Development tool

  17. Maintenance logistics • Current HPC technology MBTF for hardware and system software result in failure rates of ~ 2 nodes per week on a cluster a ~600 nodes. • It is expected that SKA exascale systems could contain ~100,000 nodes • Thus expected failure rates of 300 nodes per week could be realistic • During system commissioning this will be 3 or 4 X • Fixing nodes quickly is vital otherwise the system will soon degrade into a non functional state • The manual engineering processes for fault detection and diagnosis on 600 will not scale to 100,000 nodes. This needs to be automated by the system software layer • Scalable maintenance procedures need to be developed between HPC system administrators, system software and smart hands in the DC • Vendor hardware replacement logistics need to cope with high turn around rates

  18. Staffing levels and training • Providing functional staffing levels and experience at remote desert location will be challenging • Its hard enough finding good HPC staff to run small scale HPC systems in Cambridge – finding orders of magnitude more staff to run much more complicated systems in a remote desert location will be very Challenging • Operational procedures using a combination of remote system administration staff and DC smart hands will be needed. • HPC training programmes need to be implemented to skill up way in advance • The HPCS in partnership SA National HPC provider and SKA organisation is already in the process of building out pan African HPC training activities

  19. Early Cambridge SKA solution - EDSAC 1

More Related