160 likes | 285 Views
AliEn v2-20. A. Abramyan , L. Betev , D. Goyal , A. Grigoras , C. Grigoras , M. Litmaath , N . Manukyan , M. Martinez, J . Porter, P. Saiz, S. Sankar , S. Schreiner. Content. New features on v2.20 TaskQueue Catalogue Service communication Deployment Summary.
E N D
AliEn v2-20 A. Abramyan, L. Betev, D. Goyal, A. Grigoras, C. Grigoras, M. Litmaath, N. Manukyan, M. Martinez, J. Porter, P. Saiz, S. Sankar, S. Schreiner
Content • New features on v2.20 • TaskQueue • Catalogue • Service communication • Deployment • Summary
Database Layout • Single DB • Innodb tables • Row locking • Foreign keys • Transactions • not used… • Lookup tables • 2 JDLs per job • JDL fields mapped to columns • Link to full graph
Brokering • Avoid Classadmatching • Less fields to parse • Match in a single SQL statement. • Two attempts at matching: • With packages already installed • With any packages • (Add a third attempt with remote data??)
File brokering Current schema Submit 4 jobs: File1 File 4 File2 File3 File 5 Broker per file Submit 3 empty subjobs If nothing left, just exit File1,2,4,5 When a job starts, analyze as much as possible File 3
More TaskQueue • MaxWaitingTime: amount of time that job can stay in ‘WAITING’ • If time exceeded, job ends up in error • New state: ERROR_EW (Expired Waiting) • Retrial: • Number of times that a single job can be resubmitted • Resubmission done by central services • Reusing JobId in resubmission • Direct removal of KILLED jobs
Some results… • DB time to insert a job, and 8 change status: Time to process all 230M ALICE jobs: 4.8 days
Service communication • Replacing SOAP with JSON • Less overhead (no XML encoding) • Easier to interact with other clients • And even from a web browser • Backward incompatible change
SOAP vs JSON • Apache web server • 32 hosts for clients • 16 cores • 8000 calls per client
Catalogue • Innodb tables • Row locking • Transactions • Foreign keys
Deployment • All the features already deployed on ALICE_TEST • Instead of one single big-bang release, divide it in three: • TaskQueue • JSON • Catalogue • Reduces amount of downtime, • Increases complexity of deployment…
Central Services 80 sites AliEn v2-19.(80-163) 80 sites Central Services 8 machines AliEn v2-19** 8 machines vobox catalogue aliensh Api TaskQueue Transfers Api Api ROOT LDAP Api BACKUP JA 12 machines AliEn v2-19**, v2-17 12 machines 3 machines (+1 slave, backups) 3 machines (+1 slave, backups) AliEn v2-17 40.000 wn AliEn v2-19.(80-163) 40.000 wn
Deployment of TaskQueue • Only needed on the central services • Database migration of 1 hour (24 GB) • Already done! • Monday, 1st Oct • Downtime of 12 hours • Method: • Install new version • Stop services • Convert DB • Start services
Deployment of JSON • Full deployment • Once Central Services updated, old installation won’t be able to connect • No database migration • Plan: • Install new version everywhere • Stop all services • Restart everything with new version • When: • ?
Deployment of catalogue • Only needed on central services • Very delicate operation • Database migration of 24 hours • 430 GB, 290 big tables • Plan: • Prepare a hybrid version • Install v2-20 and hybrid • Restart services with hybrid • Convert DB • Restart services with v2-20 • When:?
Summary • Parts of AliEn v2.20 already deployed! • TaskQueue speed improved drastically • 40 times insertion rate • 20 times resubmission time • Improved concurrency • Need to schedule 2 more upgrades • JSON: Improve service communication • New catalogue layout