190 likes | 327 Views
Progress on HPC’s (or building “HPC factory” at ANL). Doug Benjamin Duke University. Introduction. People responsible for this effort: Tom LeCompte (ANL –HEP division) Tom Uram (ANL – ALCF – MCS division) Doug Benjamin (Duke University) Work supported by US ATLAS
E N D
Progress on HPC’s(or building “HPC factory” at ANL) Doug Benjamin Duke University
Introduction • People responsible for this effort: • Tom LeCompte (ANL –HEP division) • Tom Uram (ANL – ALCF – MCS division) • Doug Benjamin (Duke University) • Work supported by US ATLAS • Development activities performed at Argonne National Lab Leader Class Facilities (ALCF) • Have a director’s computing allocation for this work • US ATLAS members have allocations at other DOE Leadership Class Facilities (HPC sites) • NERSC, Oak Ridge, BNL
Goals for this effort • Develop a simple and robust system • Scalable • Run on many different HPC sites • How else can we achieve “world domination”? • Seriously – running at many different sites maximizes the benefit to ATLAS. • Transferable • Goal is to make this system deployable/usable by many different people including HEP faculty – not just the enterprising students and computer professionals • Work with existing HEP workload management systems (iePanDA) • Reuse existing code base where ever possible • For Example – use the existing PanDApilot with small tweeks for HPC • Transparent • If you know how to use PanDA then you know how to use HPC via Balsam
HPC Boundary conditions • There are many scientific HPC machines across the US and the world. • Need to design system that is general enough to work on many different machines • Each machine is independent of other • The “grid” side of the equation must aggregate the information • There several different machine architectures • ATLAS jobs will not run unchanged on many of the machines • Need to compile programs for each HPC machine • Memory per node (each with multiple cores) varies from machine to machine • The computational nodes typically do not have connectivity to the Internet • connectivity is through login node/edge machine • Pilot jobs typically can not run directly on the computational nodes • TCP/IP stack missing on many computational nodes
Additional HPC issues • Each HPC machine has its own job management system • Each HPC machine has its own identity management system • Login/Interactive nodes have mechanisms for fetching information and data files • HPC computational nodes are typically MPI • Can get a large number of nodes • The latency between job submission and completion can be variable. (Many other users) • We have think how we can adapt to the HPC’s and not how the HPC’s can adapt to us? • It will give us more opportunities. -
Work Flow • Some ATLAS simulation jobs can be broken up into 3 components • Preparatory phase - Make the job ready for HPC • For example - generate computational grid for Alpgen • Fetch Database files for Simulation • Transfer input files to HPC system • Computational phase – can be done on HPC • Generate events • Simulate events • Post Computational phase (Cleanup) • Collect output files (log files, data files) from HPC jobs • Verify output • Unweight (if needed) and merge files
Current Prototype - (“Grid side”) Infrastructure • APF Pilot factory to submit pilots • Panda queue – currently testing an ANALY QUEUE • Local batch system • Web server provides steering XML files to HPC domain • Message Broker system to exchange information between Grid Domain and HPC domain • Gridftp server to transfer files between HTC domain and HPC domain. • Globus Online might be a good solution here (what are the costs?) • ATLAS DDM Site - SRM and Gridftp server(s). This is a working system
HPC code stack“BALSAM” "Be not distressed, friend," said Don Quixote, "for I will now make the precious balsam with which we shall cure ourselves in the twinkling of an eye." • All credit goes to Tom Uram - ANL • Work on HPC side is performed by two components • Service: Interacts with message broker to retrieve job descriptions, saves jobs in local database, notifies message broker of job state changes • Daemon: Stages input data from HTC GridFTP server, submits job to queue, monitors progress of job, and stages output data to HTC GridFTP server • Service and Daemon are built in Python, using the Django Object Relational Mapper (ORM) to communicate with the shared underlying database • Django is a stable, open-source project with an active community • Django supports several database backends • Current implementation relies on GridFTP for data transfer and the ALCF Cobalt scheduler • Modular design enables future extension to alternative data transfer mechanisms and schedulers
Some BALSAM details • Code runs in user space – has a “daemon” mode run by user • Written in python using the virtualenv system to encapsulate the python environment • Requires python 2.6 or later (not tested with v3.0 yet) • Adding additional batch queues like condor requires some code factorization by a non – expert (ie me) • Incorporating code and ideas from other projects – ieautopyfactory by John Hover and Jose Cabellero • Some of their condor bits and likely the proxy bits • Can run outside of HPC – like on my MAC (or linux machines at work and home) • Useful for code development • Work proceeding on robust error handling and good documentation
Moving to “HPC factory” • Why not have the HPC code startup the Panda pilots? • Panda pilots handle the communication between Panda system and HPC jobs through communication with the BALSAM system. • Standard ATLAS pilot code with some tweaks can be used in HPC system. • Modularity of the pilot helps here • The APF can be used as a guide.
Status of porting pilot to HPC • Using the VESTA machine at ANL • Current generation HPC (on rack equivalent of the recently dedicated MIRA HPC) • Python 2.6 running on login node • Pilot code starts up and begins to run • … but… terminates early due to missing links to ATLAS code area • Straight forward to fix (need to predefine environmental variables and create needed files) • Expect other types of issues. • Need to agree on how the pilot will get information from the BALSAM system and vise-versa
Other ATLAS code changes • Given that many of the Leadership machines in US are not x86-64 based processors will need to modify the ATLAS transforms. • New transform routines “Gen-TF” will be needed to run on these machines. • Effort might exist at ANL to do some of this work • New ANL scientist hire • New US ATLAS Graduate Fellow some time available for HPC code work • Transformation code work part of a larger effort to be able to run on more heterogeneous architectures.
Where we could use help • BALSAM code will make good basis for HPC factory, flexible enough to run on many different HPC • Need help to make extensions to the existing pilot code • Need help understanding what information pilot needs and where it gets it – so we use the same pilot code as everyone else. • Need help with writing the new transforms. • We like have the existing worker bees just need consultant help
Open issues for a production system • Need a federated Identity management • Grid identify system is not used in HPC domain • Need to strictly regulate who can run on HPC machines • Security-Security (need I say more) • What is the proper scale for the Front-End grid cluster? • Now many nodes are needed? • How much data needs to be merged?
Conclusions • Many ATLAS MC jobs can be divided into a Grid (HTC) component and a HPC component • Have demonstrated that using existing ATLAS tools that we can design and build a system to send jobs from grid to HPC and back to Grid • Modular design of all components makes it easier to add new HPC sites and clone the HTC side if needed for scaling reasons. • A lightweight yet powerful system is being developed– the beginning of “HPC factory”
Message Broker system • System must have large community support beyond just HEP • Solution must be open source (Keeps Costs manageable) • Message Broker system must have good documentation • Scalable • Robust • Secure • Easy to use • Must use a standard protocol (AMQP 0-9-1 for example) • Clients in multiple languages (like JAVA/Python)
RabbitMQ message broker • ActiveMQ and RabbitMQ evaluated. • Google analytics shows both are equally popular • Bench mark measurements show that RabbitMQ server out performs ActiveMQ • Found it easier to handle message routing and our work flow • Pika python client easy to use.
Basic Message Broker design • Each HPC has multiple permanent durable queues. • One queue per activity on HPC • Grid jobs send messages to HPC machines through these queues • Each HPC will consume messages from these queues • Routing string is used to direct message to the proper place • Each Grid Job will have multiple durable queues • One queue per activity (Step in process) • Grid job creates the queues before sending any message to HPC queues • On completion of grid job job queues are removed • Each HPC cluster publishes message to these queues through an exchange • Routing string is used to direct message to the proper place • Grid jobs will consume messages the messages only on its queues. • Grid domains and HPC domains have independent polling loops • Message producer and Client code needs to be tweaked for additional robustness