Progress on HPC’s (or building “HPC factory” at ANL)

Progress on HPC’s(or building “HPC factory” at ANL) Doug Benjamin Duke University

Introduction • People responsible for this effort: • Tom LeCompte (ANL –HEP division) • Tom Uram (ANL – ALCF – MCS division) • Doug Benjamin (Duke University) • Work supported by US ATLAS • Development activities performed at Argonne National Lab Leader Class Facilities (ALCF) • Have a director’s computing allocation for this work • US ATLAS members have allocations at other DOE Leadership Class Facilities (HPC sites) • NERSC, Oak Ridge, BNL

Goals for this effort • Develop a simple and robust system • Scalable • Run on many different HPC sites • How else can we achieve “world domination”? • Seriously – running at many different sites maximizes the benefit to ATLAS. • Transferable • Goal is to make this system deployable/usable by many different people including HEP faculty – not just the enterprising students and computer professionals • Work with existing HEP workload management systems (iePanDA) • Reuse existing code base where ever possible • For Example – use the existing PanDApilot with small tweeks for HPC • Transparent • If you know how to use PanDA then you know how to use HPC via Balsam

HPC Boundary conditions • There are many scientific HPC machines across the US and the world. • Need to design system that is general enough to work on many different machines • Each machine is independent of other • The “grid” side of the equation must aggregate the information • There several different machine architectures • ATLAS jobs will not run unchanged on many of the machines • Need to compile programs for each HPC machine • Memory per node (each with multiple cores) varies from machine to machine • The computational nodes typically do not have connectivity to the Internet • connectivity is through login node/edge machine • Pilot jobs typically can not run directly on the computational nodes • TCP/IP stack missing on many computational nodes

Additional HPC issues • Each HPC machine has its own job management system • Each HPC machine has its own identity management system • Login/Interactive nodes have mechanisms for fetching information and data files • HPC computational nodes are typically MPI • Can get a large number of nodes • The latency between job submission and completion can be variable. (Many other users) • We have think how we can adapt to the HPC’s and not how the HPC’s can adapt to us? • It will give us more opportunities. -

Work Flow • Some ATLAS simulation jobs can be broken up into 3 components • Preparatory phase - Make the job ready for HPC • For example - generate computational grid for Alpgen • Fetch Database files for Simulation • Transfer input files to HPC system • Computational phase – can be done on HPC • Generate events • Simulate events • Post Computational phase (Cleanup) • Collect output files (log files, data files) from HPC jobs • Verify output • Unweight (if needed) and merge files

Current Prototype - (“Grid side”) Infrastructure • APF Pilot factory to submit pilots • Panda queue – currently testing an ANALY QUEUE • Local batch system • Web server provides steering XML files to HPC domain • Message Broker system to exchange information between Grid Domain and HPC domain • Gridftp server to transfer files between HTC domain and HPC domain. • Globus Online might be a good solution here (what are the costs?) • ATLAS DDM Site - SRM and Gridftp server(s). This is a working system

HPC code stack“BALSAM” "Be not distressed, friend," said Don Quixote, "for I will now make the precious balsam with which we shall cure ourselves in the twinkling of an eye." • All credit goes to Tom Uram - ANL • Work on HPC side is performed by two components • Service: Interacts with message broker to retrieve job descriptions, saves jobs in local database, notifies message broker of job state changes • Daemon: Stages input data from HTC GridFTP server, submits job to queue, monitors progress of job, and stages output data to HTC GridFTP server • Service and Daemon are built in Python, using the Django Object Relational Mapper (ORM) to communicate with the shared underlying database • Django is a stable, open-source project with an active community • Django supports several database backends • Current implementation relies on GridFTP for data transfer and the ALCF Cobalt scheduler • Modular design enables future extension to alternative data transfer mechanisms and schedulers

Some BALSAM details • Code runs in user space – has a “daemon” mode run by user • Written in python using the virtualenv system to encapsulate the python environment • Requires python 2.6 or later (not tested with v3.0 yet) • Adding additional batch queues like condor requires some code factorization by a non – expert (ie me) • Incorporating code and ideas from other projects – ieautopyfactory by John Hover and Jose Cabellero • Some of their condor bits and likely the proxy bits • Can run outside of HPC – like on my MAC (or linux machines at work and home) • Useful for code development • Work proceeding on robust error handling and good documentation

Moving to “HPC factory” • Why not have the HPC code startup the Panda pilots? • Panda pilots handle the communication between Panda system and HPC jobs through communication with the BALSAM system. • Standard ATLAS pilot code with some tweaks can be used in HPC system. • Modularity of the pilot helps here • The APF can be used as a guide.

Status of porting pilot to HPC • Using the VESTA machine at ANL • Current generation HPC (on rack equivalent of the recently dedicated MIRA HPC) • Python 2.6 running on login node • Pilot code starts up and begins to run • … but… terminates early due to missing links to ATLAS code area • Straight forward to fix (need to predefine environmental variables and create needed files) • Expect other types of issues. • Need to agree on how the pilot will get information from the BALSAM system and vise-versa

Other ATLAS code changes • Given that many of the Leadership machines in US are not x86-64 based processors will need to modify the ATLAS transforms. • New transform routines “Gen-TF” will be needed to run on these machines. • Effort might exist at ANL to do some of this work • New ANL scientist hire • New US ATLAS Graduate Fellow some time available for HPC code work • Transformation code work part of a larger effort to be able to run on more heterogeneous architectures.

Where we could use help • BALSAM code will make good basis for HPC factory, flexible enough to run on many different HPC • Need help to make extensions to the existing pilot code • Need help understanding what information pilot needs and where it gets it – so we use the same pilot code as everyone else. • Need help with writing the new transforms. • We like have the existing worker bees just need consultant help

Open issues for a production system • Need a federated Identity management • Grid identify system is not used in HPC domain • Need to strictly regulate who can run on HPC machines • Security-Security (need I say more) • What is the proper scale for the Front-End grid cluster? • Now many nodes are needed? • How much data needs to be merged?

Conclusions • Many ATLAS MC jobs can be divided into a Grid (HTC) component and a HPC component • Have demonstrated that using existing ATLAS tools that we can design and build a system to send jobs from grid to HPC and back to Grid • Modular design of all components makes it easier to add new HPC sites and clone the HTC side if needed for scaling reasons. • A lightweight yet powerful system is being developed– the beginning of “HPC factory”

Extra slides

Message Broker system • System must have large community support beyond just HEP • Solution must be open source (Keeps Costs manageable) • Message Broker system must have good documentation • Scalable • Robust • Secure • Easy to use • Must use a standard protocol (AMQP 0-9-1 for example) • Clients in multiple languages (like JAVA/Python)

RabbitMQ message broker • ActiveMQ and RabbitMQ evaluated. • Google analytics shows both are equally popular • Bench mark measurements show that RabbitMQ server out performs ActiveMQ • Found it easier to handle message routing and our work flow • Pika python client easy to use.

Basic Message Broker design • Each HPC has multiple permanent durable queues. • One queue per activity on HPC • Grid jobs send messages to HPC machines through these queues • Each HPC will consume messages from these queues • Routing string is used to direct message to the proper place • Each Grid Job will have multiple durable queues • One queue per activity (Step in process) • Grid job creates the queues before sending any message to HPC queues • On completion of grid job job queues are removed • Each HPC cluster publishes message to these queues through an exchange • Routing string is used to direct message to the proper place • Grid jobs will consume messages the messages only on its queues. • Grid domains and HPC domains have independent polling loops • Message producer and Client code needs to be tweaked for additional robustness

Progress on HPC’s (or building “HPC factory” at ANL)

Progress on HPC’s (or building “HPC factory” at ANL)

Presentation Transcript

Building a HPC Grid for AMS

HPC At PNNL March 2004

HPC XC cluster Slurm and HPC-LSF

Building HPC Big Data Systems

HPC Rhinoplasty

Intro to HPC at Lehigh

HPC at HCC Jun Wang

HPC Cloud: Hype or Reality?

Model Diagnostics at HPC

Microsoft HPC

Bioinformatics/HPC

HPC Checkers

HPC at HCC Jun Wang

HPC Software Development at LLNL

Jharrod LaFon (HPC-3) Jim Williams (HPC-3)

HPC Program

HPC at CEA/DAM

HPC at GE

HPC at INRIA

HPC

Barriers to Industry HPC Use or “Blue Collar” HPC as a Solution

HPC Activities at LBNL