1 / 19

Progress on HPC’s (or building “HPC factory” at ANL)

Progress on HPC’s (or building “HPC factory” at ANL). Doug Benjamin Duke University. Introduction. People responsible for this effort: Tom LeCompte (ANL –HEP division) Tom Uram (ANL – ALCF – MCS division) Doug Benjamin (Duke University) Work supported by US ATLAS

melita
Download Presentation

Progress on HPC’s (or building “HPC factory” at ANL)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Progress on HPC’s(or building “HPC factory” at ANL) Doug Benjamin Duke University

  2. Introduction • People responsible for this effort: • Tom LeCompte (ANL –HEP division) • Tom Uram (ANL – ALCF – MCS division) • Doug Benjamin (Duke University) • Work supported by US ATLAS • Development activities performed at Argonne National Lab Leader Class Facilities (ALCF) • Have a director’s computing allocation for this work • US ATLAS members have allocations at other DOE Leadership Class Facilities (HPC sites) • NERSC, Oak Ridge, BNL

  3. Goals for this effort • Develop a simple and robust system • Scalable • Run on many different HPC sites • How else can we achieve “world domination”? • Seriously – running at many different sites maximizes the benefit to ATLAS. • Transferable • Goal is to make this system deployable/usable by many different people including HEP faculty – not just the enterprising students and computer professionals • Work with existing HEP workload management systems (iePanDA) • Reuse existing code base where ever possible • For Example – use the existing PanDApilot with small tweeks for HPC • Transparent • If you know how to use PanDA then you know how to use HPC via Balsam

  4. HPC Boundary conditions • There are many scientific HPC machines across the US and the world. • Need to design system that is general enough to work on many different machines • Each machine is independent of other • The “grid” side of the equation must aggregate the information • There several different machine architectures • ATLAS jobs will not run unchanged on many of the machines • Need to compile programs for each HPC machine • Memory per node (each with multiple cores) varies from machine to machine • The computational nodes typically do not have connectivity to the Internet • connectivity is through login node/edge machine • Pilot jobs typically can not run directly on the computational nodes • TCP/IP stack missing on many computational nodes

  5. Additional HPC issues • Each HPC machine has its own job management system • Each HPC machine has its own identity management system • Login/Interactive nodes have mechanisms for fetching information and data files • HPC computational nodes are typically MPI • Can get a large number of nodes • The latency between job submission and completion can be variable. (Many other users) • We have think how we can adapt to the HPC’s and not how the HPC’s can adapt to us? • It will give us more opportunities. -

  6. Work Flow • Some ATLAS simulation jobs can be broken up into 3 components • Preparatory phase - Make the job ready for HPC • For example - generate computational grid for Alpgen • Fetch Database files for Simulation • Transfer input files to HPC system • Computational phase – can be done on HPC • Generate events • Simulate events • Post Computational phase (Cleanup) • Collect output files (log files, data files) from HPC jobs • Verify output • Unweight (if needed) and merge files

  7. Current Prototype - (“Grid side”) Infrastructure • APF Pilot factory to submit pilots • Panda queue – currently testing an ANALY QUEUE • Local batch system • Web server provides steering XML files to HPC domain • Message Broker system to exchange information between Grid Domain and HPC domain • Gridftp server to transfer files between HTC domain and HPC domain. • Globus Online might be a good solution here (what are the costs?) • ATLAS DDM Site - SRM and Gridftp server(s). This is a working system

  8. HPC code stack“BALSAM” "Be not distressed, friend," said Don Quixote, "for I will now make the precious balsam with which we shall cure ourselves in the twinkling of an eye." • All credit goes to Tom Uram - ANL • Work on HPC side is performed by two components • Service: Interacts with message broker to retrieve job descriptions, saves jobs in local database, notifies message broker of job state changes • Daemon: Stages input data from HTC GridFTP server, submits job to queue, monitors progress of job, and stages output data to HTC GridFTP server • Service and Daemon are built in Python, using the Django Object Relational Mapper (ORM) to communicate with the shared underlying database • Django is a stable, open-source project with an active community • Django supports several database backends • Current implementation relies on GridFTP for data transfer and the ALCF Cobalt scheduler • Modular design enables future extension to alternative data transfer mechanisms and schedulers

  9. Some BALSAM details • Code runs in user space – has a “daemon” mode run by user • Written in python using the virtualenv system to encapsulate the python environment • Requires python 2.6 or later (not tested with v3.0 yet) • Adding additional batch queues like condor requires some code factorization by a non – expert (ie me) • Incorporating code and ideas from other projects – ieautopyfactory by John Hover and Jose Cabellero • Some of their condor bits and likely the proxy bits • Can run outside of HPC – like on my MAC (or linux machines at work and home) • Useful for code development • Work proceeding on robust error handling and good documentation

  10. Moving to “HPC factory” • Why not have the HPC code startup the Panda pilots? • Panda pilots handle the communication between Panda system and HPC jobs through communication with the BALSAM system. • Standard ATLAS pilot code with some tweaks can be used in HPC system. • Modularity of the pilot helps here • The APF can be used as a guide.

  11. Status of porting pilot to HPC • Using the VESTA machine at ANL • Current generation HPC (on rack equivalent of the recently dedicated MIRA HPC) • Python 2.6 running on login node • Pilot code starts up and begins to run • … but… terminates early due to missing links to ATLAS code area • Straight forward to fix (need to predefine environmental variables and create needed files) • Expect other types of issues. • Need to agree on how the pilot will get information from the BALSAM system and vise-versa

  12. Other ATLAS code changes • Given that many of the Leadership machines in US are not x86-64 based processors will need to modify the ATLAS transforms. • New transform routines “Gen-TF” will be needed to run on these machines. • Effort might exist at ANL to do some of this work • New ANL scientist hire • New US ATLAS Graduate Fellow some time available for HPC code work • Transformation code work part of a larger effort to be able to run on more heterogeneous architectures.

  13. Where we could use help • BALSAM code will make good basis for HPC factory, flexible enough to run on many different HPC • Need help to make extensions to the existing pilot code • Need help understanding what information pilot needs and where it gets it – so we use the same pilot code as everyone else. • Need help with writing the new transforms. • We like have the existing worker bees just need consultant help

  14. Open issues for a production system • Need a federated Identity management • Grid identify system is not used in HPC domain • Need to strictly regulate who can run on HPC machines • Security-Security (need I say more) • What is the proper scale for the Front-End grid cluster? • Now many nodes are needed? • How much data needs to be merged?

  15. Conclusions • Many ATLAS MC jobs can be divided into a Grid (HTC) component and a HPC component • Have demonstrated that using existing ATLAS tools that we can design and build a system to send jobs from grid to HPC and back to Grid • Modular design of all components makes it easier to add new HPC sites and clone the HTC side if needed for scaling reasons. • A lightweight yet powerful system is being developed– the beginning of “HPC factory”

  16. Extra slides

  17. Message Broker system • System must have large community support beyond just HEP • Solution must be open source (Keeps Costs manageable) • Message Broker system must have good documentation • Scalable • Robust • Secure • Easy to use • Must use a standard protocol (AMQP 0-9-1 for example) • Clients in multiple languages (like JAVA/Python)

  18. RabbitMQ message broker • ActiveMQ and RabbitMQ evaluated. • Google analytics shows both are equally popular • Bench mark measurements show that RabbitMQ server out performs ActiveMQ • Found it easier to handle message routing and our work flow • Pika python client easy to use.

  19. Basic Message Broker design • Each HPC has multiple permanent durable queues. • One queue per activity on HPC • Grid jobs send messages to HPC machines through these queues • Each HPC will consume messages from these queues • Routing string is used to direct message to the proper place • Each Grid Job will have multiple durable queues • One queue per activity (Step in process) • Grid job creates the queues before sending any message to HPC queues • On completion of grid job job queues are removed • Each HPC cluster publishes message to these queues through an exchange • Routing string is used to direct message to the proper place • Grid jobs will consume messages the messages only on its queues. • Grid domains and HPC domains have independent polling loops • Message producer and Client code needs to be tweaked for additional robustness

More Related