190 likes | 199 Views
Extension of DIRAC to enable distributed computing using Windows resources. 3 rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand. J. Coles , Y. Y. Li, K. Harrison, A. Tsaregorodtsev, M. A. Parker, V. Lyutsarev. Overview. Why port to Windows and who is involved? DIRAC overview
E N D
Extension of DIRAC to enable distributed computing using Windows resources 3rd EGEE User Forum 11-14 February 2008, Clermont-Ferrand J. Coles, Y. Y. Li, K. Harrison, A. Tsaregorodtsev, M. A. Parker, V. Lyutsarev
Overview • Why port to Windows and who is involved? • DIRAC overview • Porting process • Client (job creation/submission) • Agents (job processing) • Resources • Successes/usage • Deployment • Summary University of Cambridge
Motivation • Aim: • Enabling Windows computing resources in the LHCb workload and data management system DIRAC • Allow what can be done under Linux to be possible under Windows • Motivation: • To increase the number CPU resources available to LHCb for production and analysis • To offer a service to Windows users • Allow transparent job submissions and execution on Linux and Windows • Who’s involved: • Cambridge, Cavendish – Ying Ying Li, Karl Harrison, Andy Parker • Marseilles, CPPM - Andrei Tsaregorodtsev (DIRAC Architect) • Microsoft Research – Vassily Lyutsarev University of Cambridge
DIRAC Overview • Distributed Infrastructure with Remote Agent Control • LHCb’s distributed productionand analysis workload and data management system • Written in Python • 4 sections • Client • User interface • Services • DIRAC Work Management System, based on the main Linux server • Agents • Resources • CPU resources and Data storage
DISET security module • DIRAC Security Transport module – underlying security module of DIRAC • Provides grid authentication and encryption (using X509 certificates and grid proxies) between the DIRAC components • Uses OpenSSL with pyOpenSSL (DIRAC’s modified version) wrapped around it. • Standard: Implements Secure Sockets Layer and Transport Layer Security, and contains cryptographic algorithm. • Additional: Grid proxy support • Pre-built OpenSSL and pyOpenSSL libraries are shipped with DIRAC • Windows libraries are provided alongside Linux libraries, allowing appropriate libraries to be loaded at run time • Proxy generation under Windows • Multi-platform command: dirac-proxy-init • Validation of generated proxy is checked under both Windows and Linux University of Cambridge
Client – job submissions SoftwarePackages = { “DaVinci.v12r15" }; InputSandbox = { “DaVinci.opts” }; InputData = { "LFN:/lhcb/production/DC04/v2/00980000/DST/Presel_00980000_00001212.dst" }; JobName = “DaVinci_1"; Owner = "yingying"; StdOutput = "std.out"; StdError = "std.err"; OutputSandbox = { "std.out", "std.err", “DaVinci_v12r15.log” “DVhbook.root” }; JobType = "user"; import DIRAC from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication(‘DaVinci', 'v12r15') job.setInputSandbox(['DaVinci.opts’]) job.setInputData(['LFN:/lhcb/production/DC04/v2/00980000/DST/Presel_00980000_00001212.dst']) job.setOutputSandbox([‘DaVinci_v12r15.log’, ‘DVhbook.root’]) dirac.submit(job) • Submissions made with valid grid proxy • Three Ways • JDL (Job Description Language) • DIRAC API • Ganga • Built on DIRAC API commands • Currently under porting process to Windows • Successful job submission returns job ID, provided by Job Monitoring Service JDL API > myjob.py or enter directly in python under Windows • > dirac-job-submit.py myjob.jdl • Under Windows University of Cambridge
DIRAC Agent under Windows • Python installation script • Downloads and installs DIRAC software, and sets up DIRAC Agent • Agents are initiated on free resources • Agent Job retrieval: • Run DIRAC Agent to see if there are any suitable jobs on the server. • Agent retrieves any matched jobs. • Agent Reports to Job Monitoring Service of job status • Agent downloads and installs required applications to run the job. • Agent retrieves any required data. • (see next slide) • Agent creates Job Wrapper to run the job (wrapper platform aware). • Upload output to storage if requested Web Monitoring Linux Sites Windows Sites University of Cambridge
Data access • Data access to LHCb’s distributed data storage system requires: • Access to LFC (LCG File Catalogue, maps LFNs (Logical File Names) to the PFNs (Physical File Names)) • Access to the Storage Element • On Windows a catalogue client is provided via the DIRAC portal service • Uses DIRAC’s security moduleDISET and a valid user’s grid proxy • Authenticates to Proxy server, and proxy server contacts File catalogue on user’s behalf with its own credentials • Uses .NetGridFTP client 1.5.0 provided by University of Virginia • Based on GridFTP v1, from tests it seems to be compatible with GridFTP server used by LHCb (edg uses GridFTP client 1.2.5-1 and globus GT2) • Client contains functions needed for file transfers • get, put, mkdir • And a batch tool that mimics the command flags of globus-url-copy • Requirements: • .Net v2.0 • .NetGridFTP binaries are shipped with DIRAC • Allows full data registration and transfer to any Storage Element supporting GridFTP University of Cambridge
DIRAC CE backends • DIRAC provides a variety of Compute Element backends under Linux: • Inprocess (standalone machine), LCG, Condor etc… • Windows: • Inprocess • Agent loops in preset intervals assessing the status of the resource • Microsoft Windows Compute Cluster • Additional Windows specific CE backend • Requiresone shared installation of DIRACand applications on theHead nodeof the cluster • Agents are initiated from theHead node, and communicates with theCompute Cluster Services • Job outputsare uploaded to the Sandboxes directlyfrom the worker nodes University of Cambridge
RAW Statistics DST DST Sim RAWmc DaVinci Analysis Boole Digitalisation Brunel Reconstruction Bender LHCb applications • Five main LHCb applications (C++ : Gauss, Boole, Brunel, DaVinci Python: Bender) Gauss Event Generation Detector Simulation Data flow from detector MC Production Job Sim – Simulation data format RAWmc – RAW Monte Carlo, equivalent to RAW data format from detector DST – Data Storage Tape Analysis Job University of Cambridge
Gauss • Most LHCb applications are compiled for both Linux and Windows • For historical reasons, we use Microsoft Visual Studio .Net 2003 • Gauss – only application, previously not compiled under Windows. • Gauss relies on three major pieces of software not developed by LHCb • Pythia6: simulation of particle production – Legacy Fortran code • EvtGen: Simulation of particle decays – C++ • Geant4: Simulation of detector – C++ • Gauss needs each of the above to run under Windows • Work strongly supported by LHCb and LCG software teams • All third-party software now successfully built under Windows • Most build errors have resulted from Windows compiler being less tolerant of “risky coding” than gcc • Insist on arguments passed to function being of correct type • More strict about memory management • Good for forcing code improvements! • Able to fully build Gauss under Windows with both Generator and Simulation parts • We are able to produce full Gauss jobs of BBbar events, with comparable distributions to those produced under Linux • Have installed and tested Gauss v30r4 on Cambridge cluster • Latest release of Gauss v30r5 • First fully Windows compatible release • Contains both pre-built GEANT4 and Generator Windows binaries University of Cambridge
Cross-platform job submissions • Job creation and submission process is the same under both Linux and Windows (i.e. uses the same DIRAC API commands, and the same steps) • Two current types of main LHCb grid jobs • MC Production Jobs – CPU intensive, no input required. Potentially ideal for ‘CPU scavenging’ jobs • Recent efforts (Y.Y.Li, K.Harrison) allowed Gauss to compile under Windows (see previous slide) • A full MC production chain is still to be demonstrated on Windows • Analysis Jobs – Requires input (data, private algorithms, etc …) • DaVinci, Brunel, Boole • Note: requires C++ compiler for customised user algorithms • Jobs submitted with libraries are bound to the same platform for processing • Platform requirements can be added during job submission • Bender (Python) • Note: no compiler, linker or private library required • Allows cross-platform analysis jobs to be performed • Results retrieved to local computer via >dirac_job_get_output.py 1234results in the outputsandbox >dirac-rm-get(LFN)this uses GridFTP to retrieve outputdata from a Grid SE University of Cambridge
DIRAC Widows usage • DIRAC is supported on two Windows platforms • Windows XP • Windows Server 2003 • Use of DIRAC to run LHCb physics analysis under Windows • Comparison between DC04 and DC06 data on B±→D0(Ksπ+π-)K± channel • 917,000 DC04 events processed under Windows, per selection run • ~48hours total CPU time on 4 nodes • Further ~200 jobs (totalling ~4.7 million events) submitted from Windows to DIRAC, processing on LCG, retrieved on Windows • Further selection background studies are currently being carried out with the system • Processing speed comparisons between Linux and Windows • Difficult, as currently the Windows binaries are built in debug mode by default University of Cambridge
DIRAC deployment Future deployment University of Cambridge
Windows wrapping • Bulk of DIRAC python code was already platform independent • However not all python modules are platform independent • Three types of code modifications/additions: • Platform specific libraries and binaries (e.g. OpenSSL, pyOpenSSL, .NetGridFTP) • Additional Windows specific code (e.g. Windows Compute Cluster CE backend, .bat files to match Linux shell scripts) • Minor Python code modifications (e.g. changing process forks to threads) • Dirac installation ~ 60MB • Per LHCb application ~ 7GB Windows Specific 6% Windows port modifications by file size of used DIRAC code Modified for cross-platform compatibility 34% Unmodified 60% University of Cambridge
Summary • Working DIRAC v2r11, and able to integrate both Windows standalone and cluster CPUs to existing Linux system • Porting – replacement of Linux specific python code & provision of windows equivalents where platform independence not possible (e.g. pre-compiled libs, secure file transfers…) • Windows platforms tested: • Windows XP • Windows Server 2003 • Cross-platform job submissions and retrievals • Little change to syntax for user • Full analysis jobs cycle on Windows, from algorithm development to results analysis. (BenderRunning(linux)Getting results ) • Continued use for further physics studies • All applications for MC production jobs tested • Deployment extended to three site so far, totalling 100+ Windows CPUs. • Two Windows Compute Cluster sites Future plans: • Test the full production chain • Deploy on further systems/sites e.g. Birmingham • Larger scale test • Continued usage for physics studies • Provide a useful tool when LHC data arrives University of Cambridge
Backup slides University of Cambridge
Cross-platform compatibility University of Cambridge
Head Node 1 DIRAC Job Matcher Agent 2 Job Management Service DaVinci 3 Software Repository Sandbox Service DIRAC Wrapper Job Monitoring Service Job Watch-dog WMS LFC Service Proxy Server DISET Local SE Job Submission by User University of Cambridge