190 likes | 204 Views
DQ2 status & plans. BNL workshop October 3, 2007. Outline. Focusing mostly on site services Releases Observations from 0.3.x 0.4.x Future plans. Release status. 0.3.2 (‘stable’) in use on OSG [ I think :-) ] 0.4.0 being progressively rolled out And 0.4.0_rc8 in use on all LCG sites
E N D
DQ2 status & plans BNL workshop October 3, 2007
Outline Focusing mostly on site services • Releases • Observations from 0.3.x • 0.4.x • Future plans
Release status • 0.3.2 (‘stable’) in use on OSG [ I think :-) ] • 0.4.0 being progressively rolled out • And 0.4.0_rc8 in use on all LCG sites • 0.3.x was focused on moving central catalogues to better DB. • The site services were essentially the same as on 0.2.12. • 0.4.x is about site services only • In fact, 0.4.x version number applies to site services only, and clients remain on 0.3.x
Observations from 0.3.x • Some problems were solved: • overloading central catalogues • big datasets • choice of next files to transfer • … but this introduced another problem: • as we were more ‘successful’ into filling up the site services queue of files… • we observed overloading the site services with expensive ORDERing queries (e.g. to choose which files to transfer next)
Observations from 0.3.x • The ramp-up in simulation production followed by poor throughput created a large backlog • Also, site services were (still are) insufficiently protected for requests impossible to fulfill: • “transfer this dataset, but keep polling for data to be created - or streamed from another site” .. but the data never came/never was produced/was subscribed from a source that was never meant to have it…
Observations from 0.3.x • Most relevant changes to the site services were to handle large datasets • less important for PanDA production but very relevant for LCG as all were large datasets
Implementation of 0.3.x • The implementation was also too simplistic for the load we observed: • MySQL database servers were not extensively tuned, too much reliance on database for IPC, heavy overload from ‘polling’ FTS and LFCs (e.g. no GSI session reuse for LFC) • loads >10 were common (I/O+CPU), in particular with MySQL database on the same machine (~10 processes doing GSI plus mysqld doing ORDER BYs)
Deployment of 0.3.x • FTS @ T1s is usually run in high performing hardware, with the FTS agents split from the database • … but FTS usually has a few thousand files in the system, on a typical T1 • DQ2 is usually deployed with the DB joint with the site services • And its queue is hundreds of thousands of files in a typical T1 • DQ2 is the system throttling FTS • where the expensive brokering decisions are made and where the larger queue is maintained
Evolution of 0.3.x • During 0.3.x, the solution ended up being to simplify 0.3.x in particular reducing its ORDERing • Work on 0.4.x had already started • to fix database interactions while maintaining essentially same logic
0.4.x • New DB schema, more reliance on newer MySQL features (e.g. triggers) less on expensive features (e.g. foreign key constraints) • Able to sustain and order large queues (e.g. FZK/LYON are running with >1.5M files in their queues) • One instance of DQ2 0.4.x is used to serve ALL M4 + Tier-0 test data • One instance of DQ2 0.4.x is used to serve 26 sites (FZK + T2s AND LYON + T2s)
What remains to be done • Need another (smaller) iteration on channel allocation and transfer ordering • e.g. in-memory buffers to prevent I/O on the database, creating temporary queues in front of each channel with tentative ‘best files to transfer next’ • Some work remains to have services resilient to failures • e.g. dropped MySQL connection • Still need to tackle some ‘holes’ • e.g. queues of files for which we cannot find replicas may still grow forever • if a replica indeed appears for one of them the system may take too long to consider this file to transfer • … but already introduced BROKEN subscriptions to release some load
What remains to be done • More site services local monitoring • work nearly completed • will be deployed when we are confident it does not cause any harm to the database • we still observe deadlocks • next slides…
Expected patches to 0.4.x • 0.4.x branch will continue to focus on site services only • channel allocation, source replicas lookup and submit queues + monitoring • Still, the DDM problem as a whole can only be solved by having LARGE files • while we need to sustain queues with MANY files, if we continue with the current file size, the “per-event transfer throughput” will remain very inefficient • plus more aggressive policies on denying/forgetting about subscription requests
After 0.4.x • 0.5.x will include an extension to the central catalogues • location catalogue only • This change follows a request from user analysis and LCG production • Goal is to provide central overview of incomplete datasets (files missing) • but handle also dataset deletion (returning list of files to delete at a site, coping with overlapping datasets - quite a hard problem!) • First integration efforts (as prototype is now complete) expected to begin mid-Nov
After 0.5.x • Background work has began on a central catalogue update • Important schema change: new timestamp oriented unique identifiers for datasets • allowing partitioning of the backend DB transparently to the user and proving more efficient storage schema • 2007 datasets on an instance, 2008 datasets on another.. • Work has started, on a longer timescale, as the change will be fully backward compatible • old clients will continue to operate as today, to facilitate any 0.3.x->1.0 migration
Constraints • In March we decided to centralize DQ2 services on LCG • as a motivation to understand problems with production simulation transfers • as our Tier-0 -> Tier-1 tests had always been quite successful in using DQ2 • now, 6 months later we finally start seeing some improvement • many design decisions of the site services were altered to adapt to production simulation behaviour (e.g. many fairly “large” open datasets) • We expect to continue to need to operate centrally all LCG DQ2 instances for a while more • support and operations are now being setup • but there is an important lack of technical people, aware of MySQL/DQ2 internals
Points • Longish threads and the use of Savannah for error reports • e.e.g recently we kept getting internal Panda error messages from the worker node for some failed jobs, which were side-effects of some failure in the central Panda part • Propose a single (or at least a primary) contact point for central catalogue issues (Tadashi + Pedro?) • For site services and whenever problem is clear, please post also report on Savannah • We have missed minor bug reports due to this • Clarify DQ2 role on OSG and signal possible contributions