190 likes | 290 Views
Roadmap to AliEn v2-20. A. Abramyan , L. Betev , D. Goyal , A. Grigoras , C. Grigoras , M. Litmaath , N . Manukyan , M. Martinez, J . Porter, P. Saiz, S. Sankar , S. Schreiner. What’s new. Plenty of new improvements Catalogue simplification Client UI Extreme Job Brokering
E N D
Roadmap to AliEn v2-20 A. Abramyan, L. Betev, D. Goyal, A. Grigoras, C. Grigoras, M. Litmaath, N. Manukyan, M. Martinez, J. Porter, P. Saiz, S. Sankar, S. Schreiner
What’s new • Plenty of new improvements • Catalogue simplification • Client UI • Extreme Job Brokering • Removal of PackMan • New JDL fields • Proxy renewal • Job Memory checkup • And baseline for new development
Catalogue Simplification • Up to now, catalogue divided in multiple DB: • Simplifies scalibility • Logic slightly more complicated • Changing username/userid • Smaller tables Thanks Dushyant, Subho
PackMan • Removing the PackMan/PackManMaster services • Functionality stays in client UI/JA • JA can install packages directly • Very powerful if combined with torrent • Speeds up most of the packman operations Thanks Narine, Armenuhi
New JDL fields • MaxWaitingTime: amount of time that job can stay in ‘WAITING’ • If time exceeded, job ends up in error • New state: ERROR_EW (Expired Waiting) • Retrial: • Number of times that a single job can be resubmitted • Resubmission done by central services • Reusing JobId in resubmission • Direct removal of KILLED jobs Thanks Miguel
Extreme Brokering • Postpone splitting of job until last moment • Decide data to be analyzed based on current location of JA & files not analyzed yet • Can define Max/Min number of files to be analyzed • Even if the files are not local • Less subjobs: • Easier merging Thanks Pablo
Current situation Works nicely if one replica per file Job Manager JOB JOB JOB JOB JOB JOB A bit more complex with 3 SE and 2 replicas And a lot more with 50 SE and 3 replicas Job Manager Job Manager JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB JOB
Example Current schema Submit 4 jobs: File1 File 4 File2 File3 File 5 Broker per file Submit 3 empty subjobs If nothing left, just exit File1,2,4,5 When a job starts, analyze as much as possible File 3
Proxy renewal system • Replaces vobox-proxy-renewal service • Can receive ‘validity’ or proxies • Simplifies CREAM-CE job submission • No corruption of proxies • Can be started by non-root user • Already deployed at CERN • And for some CMS sites… • Can already be deployed Thanks Maarten
New development • More than 1 year since last mayor update • Some backward incompatible changes • Change of catalogue schema • What to do with new requests, bugs: • Debug current system? • Debug in new version? • Both!
AliEn deployment for ALICE 80 sites AliEn v2-19.(80-163) 80 sites Central Services 8 machines AliEn v2-19** 8 machines vobox catalogue aliensh Api TaskQueue Transfers Api Api ROOT LDAP Api BACKUP JA 12 machines AliEn v2-19**, v2-17 12 machines 3 machines (+1 slave, backups) 3 machines (+1 slave, backups) AliEn v2-17 40.000 wn AliEn v2-19.(80-163) 40.000 wn
How to test new versions… • Build system: • Multiple platforms • Integration & basic functionality tests • No API/access from ROOT tests • Similar to the AliROOT, ROOT build systems • Running the whole system on a single machine • http://alienbuild.cern.ch:8888
Already deployed for PANDA • Running since September • 12th PANDA Grid Workshop and 2ndAliEn Developers Week • Multiple sites, smaller load than ALICE • No API services • ‘Old’ v2.20 version Thanks PANDA
Previous major update • Stopping the whole system • 1 week to redeploy • 1 month ironing out details Not an option!
Second set of services: Central Services Central Services CE CE catalogue catalogue aliensh aliensh Api Api TaskQueue TaskQueue Transfers Transfers Api Api Api Api ROOT ROOT LDAP LDAP Api Api JA JA
Second set of services • Copy of the catalogue • 3 different central machines, 3 voboxes, same SE • What to do with output • Throw away (easiest) • Incorporate back (easy if output in a different directory)
Timeline Mar Apr May Now: 1 week: Investigate test system 1 week: Test Catalogue migration 1 week: Define New VO 1 week: Verify quotas 1 month: New hardware for CS 2 days: Central deployment from backup 3 days: First site working (CERN) 2 weeks: At least 2 external sites (CCIN2P3, ?) After that works, keep adding sites 2 months: 1 day: Switch VO 1 day: Overall site upgrade
Summary • AliEn v2.20 ready for deployment • With plenty of new features and bug fixes • Minimize upgrade downtime • Create testing setup with several sites, and with all the SE • More effort on testing (also from site admins) • Deploy Test V0 with ALICE sites • And say goodbye to v2-19 in two months Thank you!!
Job execution TASKQUEUE Job Manager JOB JOB JOB JOB Job Broker Site C Site A CE Site B JA CE MonALISA CE MonALISA xrootd JA xrootd MonALISA xrootd JA File catalogue LFN GUID Meta data