190 likes | 283 Views
PROOF developments. G. Ganis CAF meeting, ALICE offline week , 11 July 2008. Overview. Recent / Current developments focus mostly on Solving Instabilities and improving on error recovery Improving the user interface Resource control in multiuser
E N D
PROOF developments G. Ganis CAF meeting, ALICE offline week , 11 July 2008
Overview • Recent / Current developments focus mostly on • Solving Instabilities and improving on error recovery • Improving the user interface • Resource control in multiuser • CAF is one of the main source of feedback to • Understand problems • spot missing functionality G. Ganis, CAF, Alice offline week
Today’s Subjects • Stability issue • New XrdProofd plug-in • Related issues • New Log box • Monitoring of the memory consumption • Dataset management • Schedulingdevelopments G. Ganis, CAF, Alice offline week
New XrdProofd plug-in (1) • Addresses stability issues observed typically after a failure and the attempt to reset the session • We traced-back these to deadlock situations due to concurrent actions not well protected • New plug-in implements re-designed interaction between components significantly reducing locks • The changes for the user are minimal • But the level of asynchronism introduced may confuse people looking at the process tables, as the processes are cleaned with some delay G. Ganis, CAF, Alice offline week
New XrdProofd plug-in (2) New features • Resiliance to xrootd failures/glitches • Applications attempt to restore the connections for 10 mins • Solves the problem of restarting xrootd to change the configuration • Directive to define workers in the xrootd config file • Example: on CAF DEV the workers are define with • Get rid of proof.conf xpd.worker master lxb6043 xpd.worker worker lxb60[41-42,44] xpd.worker worker lxb60[41-42,44] G. Ganis, CAF, Alice offline week
Related Improvements • Automatic shutdown of orphalin sessions • Get rid of proofserv processes hanging around • Improved notification in case of a worker death G. Ganis, CAF, Alice offline week
New Log Dialog box A. Kreshuk • Using TProof::Mgr(master)->GetSessionLogs() • Should work even if the session hangs G. Ganis, CAF, Alice offline week
Memory usage monitoring A. Kreshuk • Worker: RAM vs events proc • Master: RAM vs object merged • Should allow to spot easily mem leaks • Additional analysis w/ another tool: TMemStat? G. Ganis, CAF, Alice offline week
Memory consumption monitoring A. Kreshuk • Normal level • Workers monitor their memory usage and save info in the log file • Client get warned of high usage • The session may be eventually killed • Advanced level • Possibility to save in a dedicated tree (TProofStats) very detailed information (e.g. interface to Marian Ivanov’s memsta tool) • To be run as second pass when a problem shows up • First version in SVN the coming days G. Ganis, CAF, Alice offline week
Dataset management (1) JFGO • Hot topic for T2/T3 • Dataset: metadata about a set of files • TFileCollection: list of TFileInfo • TFileInfo • UUID, TUrl’s of the file • TFileInfoMeta: one per Ttree with name, entries, … • Data-sets are identified by name • Info may come from different places: catalogs, SQL databases, file systems G. Ganis, CAF, Alice offline week
Dataset manager (2) JFGO • TProofDataSetManager: abstract interface describing the basic functionality • RegisterDataSet, GetDataSet, VerifyDataSet, … • VerifyDataSet opens the files, i.e. may trigger staging • TProofDataSetManagerFile: implementation handling information via ROOT files datasetname.root • Stored on the master on dedicated subdirectory • <DatsetDir>/group/user/dataset G. Ganis, CAF, Alice offline week
Dataset manager (3) JFGO • TProofDataSetManagerFile is what is used on CAF • Users can register, scan, get • Verify is disallowed (to avoid staging overload) • It is run by a dedicated daemon (JFGO) • Datasets can be processed by name • Provide a way to cache the information needed at the validation step, speeding this up considerably • TProofDataSetManager can be used also locally to organize your datasets or chains. • No need of a dedicated macro to create the chain (CreateESDchain) G. Ganis, CAF, Alice offline week
Dataset manager (4) JFGO • ATLAS is very interested • They are oriented a MySQL backend and validity tokens for the dataset • Will provide TProofDataSetManagerSQL • Other issues raised by ATLAS • Possibility to use multiple dataset sources, e.g. file and SQL based concurrently • problem of the datasets in federated clusters (multi-masters) which is challenging on the PROOF side too G. Ganis, CAF, Alice offline week
Scheduling developments J. Iwaszkiewicz • Control resources and how they are used • Improving efficiency • assigning to a job those nodes that have data which needs to be analyzed. • Implementing different scheduling policies • e.g. fair share, group priorities & quotas • Efficient use even in case of congestion G. Ganis, CAF, Alice offline week
Scheduling developments (2) • Assigning a set of workers for a job based on: • The data set location • User priority (Quota + historical usage) • Can be taken for external source • The current load of the cluster • Create (priority) queues for queries that cannot be started G. Ganis, CAF, Alice offline week
Scheduling developments (3) • Implementation exists with: • # of Workers ≈ relativePriority * nFreeCPUs • Assign least loaded workers first • Missing pieces • Dynamic worker setup (advanced prototype exists) • Worker nodes auto-registration • Improved load monitoring • Support for “put-on-hold” submission (prototype) G. Ganis, CAF, Alice offline week
Dataset Lookup 2: dataset 3: file locations Client PROOF master 4: Job info Scheduler 1: Job {dataset, …} Load, history, policy, … 5: workers 6: workers Start workers Scheduling schema G. Ganis, CAF, Alice offline week
Other developments • PROOFLITE • Version of PROOF optimized for multicore machines with workers started directly by the ROOT session (no daemon) • Useful to quickly test code in a real PROOF environment • Will be used to study I/O issues in multicore • Almost ready to go into the trunk • PROOF / Condor integration • Possible ATLAS model for T3 farms not dedicated to PROOF • Condor provides mechanism to give high priority to PROOF queries when required by suspending/hibernating batch jobs G. Ganis, CAF, Alice offline week
Questions? • Credits • G.G., J. Iwaszkiewizc, A. Kreshuk, F. Rademakers • M. Meoni, J.F. Grosse-Oetringhaus (ALICE) • F.Furano, A. Peters (CERN/IT) • A. Hanushevsky (SLAC) G. Ganis, CAF, Alice offline week