1 / 13

Overview Architecture Performance

The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka , Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015. Overview Architecture Performance. LJSFi Overview. LJSFi is an acronym of Light Job Submission Framework

lchris
Download Presentation

Overview Architecture Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ATLAS Software Installation System v2Alessandro De SalvoMayukoKataoka, Arturo Sanchez Pineda,Yuri SmirnovCHEP 2015 Overview Architecture Performance

  2. LJSFiOverview • LJSFiis an acronym of Light Job Submission Framework • Developed in ATLAS since 2003 as a job submissionframework for the validation of software releases and other software installationrelatedtasks • Evolved with time to cope with the increasedload, the use of the WMS and Panda and for HA • Using a pluginarchitecture, in order to be able to pluganyotherbackend in the future • Multi-VO enabled • LJSFi can handle multiple VOs, even in the same set of servers • Web User Interface • The LJSFimaininterfaceisweb-based • Users can interact with the system in different ways, depending on theirrole • Anonymous usershavelimitedaccess, whileregisteredusers, identified by their personal certificate, have more deepaccess • Fast job turnaround, Scalability and High-Availability • LJSFiisable to cope with hundreds of resources and thousands of releases, with turnaround of the order of minutes in the submissionphase • Horizontalscalabilityisgranted by addingclasses of components to the system • HA isgranted by the DB infrastructure and the embeddedfacilities of the LJSFicomponents 2

  3. LJSFi Components • The main components of the LJSFi infrastructure are • The LJSFi Server • The Request Agents • The Installation agents • The LJSFi Server is built out of different sub-systems • The HA Installation DB • The InfoSys • The Web Interface • The monitoring facilities • The APIs • The Request Agents and Installation Agents are connected to the close Servers • To request and process the tasks • To cleanup and notify the admins • In this configuration the failure of a component is not fatal • Just the logfiles hosted on a failed server won’t be accessible 3

  4. The LJSFi v2 architecture Task Request Agent n Installation AGENT n PanDA … … Task Request Agent n Same or different sites RM Installation AGENT 2 Percona XtraDB Cluster Task Request Agent n InfoSys DDM Via redirect on server Installation AGENT 1 Via redirect on server Install DB LJSFi Server 1 LJSFi Server n … DB1 DB6 … PanDA and DDM (Rucio)clients from CVMFS memcached memcached Same or different sites HA Multi-Master Database Percona XtraDB cluster HA alias WAN Embedded HA facility Other site DB1 DBn … Users 4

  5. LJSFi HA Database • Based on PerconaXtraDB cluster • Extension of the MySQLengine, with WSREP (Write-Set Replication) patches • True multi-master, WAN-enabledengine • A cluster of 9 DB machines in Roma + 2 at CERN • The suggested minimum is 3 to have the quorum, and wewanted to be on the safe side! • More machinesmay be added, even in othersites, for betterredundancy • No powerful machine isneeded, butatleast 4GB of RAM, 100GB of HD and standard network connectivity (butlowerlatencies are better) • VMs are usedat CERN, wherewerun on the Agile Infrastructure, and no performance issuewasseen so far, including the WAN latency • Hosting the mainDBsused by LJSFi • The Installation DB: the source of release definition for CVMFS and full driver for the ATLAS installations • The InfoSys DB: the database used for resourcediscovery and matchmaking 5

  6. LJSFiInfoSys • Used for resourcediscovery and matchmaking • Connected to AGIS (ATLAS GridInformation System) and PanDA • Mirroring the needed AGIS data once every 2 hours (tunable) • Data freshnesschecks • No interactionispossible with the InfoSysif the data ageis > 4h (tunable) • May use more parameters in the matchmakingthan the onescurrentlypresent in AGIS • e.g. OS type/release/version (filled by the installation agents via callback) • Theseparameters can be sent to AGIS ifneeded, aswe do for the CVMFS attributes • Sites can be disabled from the internalmatchmakingifneeded • For example HPC and opportuniticresources (BOINC), whereweshouldnotrunautomatically the validationsassoonaswediscoverthem 6

  7. LJSFiAPIs • LJSFiprovidestwo ways to interact with the servers • The pythonAPIs • The REST APIs • The pythonAPIs are used by the LJSFi CLI • For the end-users • Used by the Installation Agents and Request Agents too • The REST APIsare usedfor a more broadspectrumof activities • Callbacks from runningjobs • Externalmonitoring • CLI commands / Installation Agents • Internal Server activities 7

  8. LJSFiRequest Agents • The LJSFi Request Agents and responsible of discovering new software releases and insert validation requestsinto the DB • Using the infosys and the matchmaker to discover resourcesnot currently offline • Handling the pre-requirements of the tasks, like • installation pre-requisites • OS type, architecture • maximum number of allowed concurrent jobs in the resources (Multicore resources) • … • The Request Agents periodically run on all the releases set in auto deployment mode • Currently the loop is set every 2 hours, but will be shortened as soon as we will bring the request agents to multi-threaded mode 8

  9. LJSFi Installation Agents [1] • Used to follow the whole job lifecycle • Processing Task Requests from the database • Collisions control among multiple agents isperformed by centrallocks on tasks • Pre-tagging site ashaving the given software beforesending the jobs • Only happening if the sites are using CVMFS • Almostall the sites are CVMFS-enabled, with a fewexceptionslike the HPC resources • Job submission • Job status check and output retrieval • Tag handling (AGIS based) • Tags are removed in case of failure of the validationjobs or added/checked in case of success • The installation agents are fully multi-threaded • Able to sendseveraljobs in parallel and follow the otheroperations • In case of problems, timeouts of the operations are providedeither from the embeddedcommandsused or by the generictimeoutfacility in the agents themselves 9

  10. LJSFi Installation Agents [2] • Severalinstallation agents can run in the same site or in differentsites • Each agent islinked to an LJSFi server, butwhenusing an HA alias it can be delocalized • Each server redirect the DB calls via haproxy to the close DB machines • Takingadvantage of the WAN Multi-Master HA properties of the DB cluster • Servingall the ATLAS grids (LCG/EGI, NorduGrid, OSG), the Cloudresources, the HPCs and opportunisticfacilities via Panda • The logfiles of every job iskept for about a month in the system, for debuggingpurposes • Logfiles are sent by the agent to theirconnectedservers • Each server knowswhere the logfiles are and can redirectseverylocallogfilerequest to the appropriate one 10

  11. LJSFi Web Interface • The LJSFi Web interfacehasbeendesigned for simplicity and clearness • https://atlas-install.roma1.infn.it/atlas_install • Most of the Input boxes are usinghintsratherthan combo boxes • Links to AGIS and Panda for the output resources • Friendly page navigation (HTML5) • Online Help • Each server have a separate Web Interface, but the interaction with the system are consistent, whatever server you are using 11

  12. Performance • The system can scale up to more thanseveralthousandsjobs per day • The horizontalscalingisgranted by adding more agents in parallel and increasing the Database cluster nodes • To improve performance a limit on the number of jobshandled by the currentlyrunning agents hasbeen set to 4000 • The systemprocesses new requestsbefore the others, to allow a fast turnaround of urgenttasks • Generallyonly a few minutes are neededbetween the task requests and the actual job submission by the agents • The systemisable to handle a large number of releases and sites • Wecurrentlyhave> 500 differentresources and > 1600 software releases or patcheshandled by the system 12

  13. Conclusions • LJSFiis in use by ATLAS since 2003 • Evolved in time from the WMS to Panda • Open System, multi-VO enabled • The infrastructure can be optimized to be used by severalVOs,evenhosted on the same server • Currentlyhandlingwellall the validationjobs in all the Grid/Cloud/HPC sites of ATLAS (> 500 resources and > 1600 software releases) • LCG/EGI • NorduGrid • OSG • Cloudsites / HPC sites / Opportunisticresources (Boinc) • Fullyfeaturedsystem, able to cope with a big load, scalable and high-available • No single point of failure 13

More Related