170 likes | 246 Views
Grid in action: from EasyGrid to LCG testbed and gridification techniques. James Cunha Werner University of Manchester. Christmas Meeting - 2005. Conventional way: Usual code (your cuts) Run BetaMiniApp in several data files one after the other. When all data is done, you have results!.
E N D
Grid in action: from EasyGrid to LCG testbed and gridification techniques. James Cunha Werner University of Manchester Christmas Meeting - 2005
Conventional way: Usual code (your cuts) Run BetaMiniApp in several data files one after the other. When all data is done, you have results! Grid way: Same usual code (your cuts) Run several copies of BetaMiniApp, each running in one data file independent. At the end, join all results! Going to grid EasyGrid does it for you!
General overview Users’ software EasyGrid for datasets Gridification algorithms for generic soft EasyTau for selected events Grid testbed
EasyGrid: an overview • Prototype for future development. RPA = guarantee of useful software • Provide all support for job submission system: • Recovers results in users’ directory • Generates reports for further analysis (aborts and abends) in one history file. • It is a Framework users can adapt to their own needs and applications. • Fully operational and integrated with LCG.
Christmas 2004: My goals were… • develop a submission system fail proof. • write web pages with all elementary tasks in HEP/Babar, to help students and newbie. • Understand q-qbar interaction through Pi0. What I have achieved in 2005…
Achievements with EasyGrid • Friendly user framework, flexible and reliable. It provides users with results, or necessary information for further analysis. • Tutorial web pages for PhD students and new researchers. • http://www.hep.man.ac.uk/u/jamwer • Pi0 Project: analysis of 500 million events and 5 Million Monte Carlo generation in 5 weeks. • http://www.hep.man.ac.uk/u/jamwer/pi0alg5.html • Anti-deuteron project: 1,500 Million events in 1 week, running in several sites in UK. More than 200 jobs in parallel. • http://www.hep.man.ac.uk/u/jamwer/deutdesc.html
LCG Installation and debug • There are several problems in LCG grid: • high number of jobs fail when running more than 200 jobs. • installation issues. • performance issues. • Installation of a complete testbed from scratch using 10 obsolete computers: http://www.hep.man.ac.uk/u/jamwer/#sec0
Testbed stress test Processing time is zero: BetaMiniApp replaced by program to print dataset name and wait some time (e.g. 300 s). 1,000 jobs submitted every time at 6 WNs testbed.
T0 T1 T2 Sub Fail 0 0 0 Aborts (1) 84 122 0 Bf33 296 144 6 Bf34 306 148 161 Bf35 314 156 195 Bf36 0 165 211 Bf37 0 172 213 Bf38 0 91 (2) 214 Number of jobs/WN • T0 and T1: Time between submissions is zero (continuous flow). • T0: WN bf36, bf37, bf38 were without pbs_mom started • T1: 1 WN crashed during test (2). • T2: time between submissions: 30 s. • CE (bf32) CPU use was >90%. • (1) Cannot plan: BrokerHelper: no compatible resources
Recommendations CE are very required in Grid (>90% CPU load!) and affects grid performance: • The number of WNs for each CE can be defined by the minimum value of submission delay and minimum queue time. • Run one CE for large farms is a limiting factor. More matched CEs per RB would reduce failure and increase performance. • File system study will provide more information soon.
Research in Gridification technologies for conventional software • Users expend years developing their source code, and they will not throw away just to use web services. • I developed an algorithm that will allow users use their own software on top of a web service layer with LCG middleware. • Preliminary tests using “fake” web services (simulated with PVM) show it is a viable and flexible approach.
Gridification algorithm • Creates parallel processes using PVM with ssh remote shell. • There is a central job, with distributes tasks over parallel processes, when slaves processes return results. No need for load balancing! • Controls slaves failures and resubmission to available slaves. There is not a checkpoint system (not worth). • Transfer time can be a bottleneck. Task streams implemented. Results with 300 empty processes in one laptop show a transfer time of 185 ms/process.
Conclusion • EasyGrid is operational. Benchmarks were a proof-of-concept under real conditions. • LCG testbed is operational, providing results, and supporting performance analysis and tuning. • Gridification algorithm is running in one Laptop with Genetic Programming/AI.
New year resolution • Analysis of linux kernel related file server issues. • LCG Performance study and Linux kernel tuning. • Implementation of EasyTau: a submission module for TauUser package using EasyGrid (running on ntuples). • Gridification algorithm running with LCG and commercial applications (WebSphere, Tivoli, Symphony, etc) • EasyGrid Product development and startup. • Run pi0 project again with EasyGrid Product and maybe … publish a paper about gridification!