270 likes | 424 Views
EDG Retreat, tutorials and Budapest meeting. Steve Fisher / RAL. No details. Much of the obvious material has already been mentioned or will be in the testbed talk Most of my material is stolen I will try to fill in the gaps. Project Retreat.
E N D
EDG Retreat,tutorials and Budapest meeting Steve Fisher / RAL
No details • Much of the obvious material has already been mentioned or will be in the testbed talk • Most of my material is stolen • I will try to fill in the gaps
Project Retreat • Project retreat held 27 & 28 August at Chevannes • ~45 participants • work package managers, architecture group, quality group, applications groups, M-ware experts, representatives from LCG, DataTAG, Globus & Condor • Agenda and material on the web: • http://documents.cern.ch/age?a021130 • 3 sessions addressing most important aspects of projects current work: • Software Release Process • Release 1.2 • Testbed 2
DAY1 Tutorial introduction Introduction to Grid computing and overview of the DataGrid project Security Testbed overview Job Submission lunch hands-on exercises: job submission EDG Tutorial The tutorials are aimed at users wishing to "gridify" their applications using EDG software and are organized over 2 full consecutive days DAY2 • Data Management • LCFG, fabric mgmt & sw distribution & installation • Applications and Use cases • Future Directions lunch • hands-on exercises: data mgmt
Tutorial rehearsal • Rehearsal at CERN, 29 & 30 August • 19 participants to check material & approach • Lessons learnt • Can’t cover as much material as we hoped • Explain why not just how • Avoid details – can read them from references afterwards • Need as many helpers as possible for hands-on exercises • Participants have difficulties with certificate management • Generated a lot of enthusiasm in the participants and EDG people doing the hands-on • Found genuine bugs during hands-on exercises • Recommend M-ware WPs send developers to help with hands-on exercises • New project people should follow the tutorial
Tutorial Schedule • CERN school of Computing, Naples, 23-27 September • 80 participants. Hands-on exercises only (presentations by Carl Kesselman & Ian Foster) • CERN, October 3 & 4 • NeSC, Edinburgh, December • Maximum 30 participants (more for the presentations) • Could then accommodate more sites • Sites must provide support and handle logistics • Organisers/helpers must attend tutorial at another site first • The tutorial does represent some load on the testbed • For the future • The material must be kept up to date with each public release of the software
Budapest • 5th EDG Project Conference PILISCSABA • Social event (folklore) - Sun • Cruise and dinner - IBM
Budapest • Monday • General status of the project by Fabrizio Gagliardi • Technical status of the project by Bob Jones • WP meetings • Tuesday • WP meetings and ATF • Wednesday • Dissemination Session • Reports of WP1-5 • Thursday, 5 September • Reports of WP6-10 and Security • Report on GLUE • Report from Globus
Application Status • WP8: High Energy Physics • LHC experiments doing tests now • ATLAS task force • WP9: Earth Observation • Installation of EDG 1.2 at ESA done • Testing to start in September • WP10: Biology • Initial tests made with EDG 1.2 • Overall comments: • General confusion about how best to use data mgmt tools • Software not yet stable enough and insufficient diagnostics information available • Too difficult to configure • Concern that EDG 1.2 in its current configuration will not scale easily to ~40 sites
WP8-10 - General • Deployed Software must be supported • Acceptance criteria are not in place for most WPs • Need real tests - real apps, long jobs, "random" behaviour of users, and > 50 users • Delays with 3rd party patches are a problem • so we have to invent hacks. • Release procedures, though formally in place, were ignored. • Interfaces are too low level for the user • They want efforts in reliability • need defensive programming • need good diagnostics • avoid single points of failure. • Documentation needs revision
Site Management • Sys-admin needs defined tests and procedures. • Installation • Some lcfg objects had never been tested - syntax errors! • Need manual checks and on each node • many interactions/iterations with many people • Running • No test procedures to locate faulty services • Tracing a problem is hard - log files in odd places with odd formats • Error messages useless • ITeam mailing list is too busy • Need to find a more constructive way of solving problems. • Need to make more use of Bugzilla • Need to be able to cover vacations and conferences
User Support • Since June only 20 questions asked • CRLs and Cas • request for accounts • commands failing due to firewalls • technical questions about installation and configuration • Q. why not use an existing solution for user support desk? • Members of the support team must be experts – • Cannot afford to provide dedicated people from the WPs. • Today's reality is that the ITeam list is the only way
The Solution: “Quality”
Existing Software Process • Over-simplification of the current situation: • Mware groups develop software in isolation • ITeam assembles it as best it can • Site managers are asked to install it • Application groups are asked to test it • Problems: • No place for the Mware groups to integrate software before delivering it to the ITeam • Inadequate software testing – leads to installation/configuration/execution faults • Running blind – no way to control or reliably plan software delivery
Autobuild etc. • A release manager will be nominated with overall responsibility for ensuring the procedure is followed • Make autobuild tools the basis of the daily work of the M-ware groups and ITeam • Nightly build from CVS repository for all software • Problems must be fixed ASAP – checked by Quality Group reps • M-ware groups give ITeam CVS tags instead of RPMs • Tagged software must be documented • M-ware group must perform and supply unit tests • Integrated with nightly build • Tagged software that fails the integration, testing or is inadequately documented will be rejected • M-ware group is responsible for fixing it
Quality Group • Recently formed Quality Group, convened by Gabriel Zaquine, is responsible for ensuring quality issues are addressed within the WPs • Ensure unit test plans are complete and followed • Follow-up on problems reported via bugzilla & in nightly builds • Organise running of appropriate code checking tools • Agree on adopted project developer-guidelines etc. • http://eu-datagrid.web.cern.ch/eu-datagrid/QAG/default.htm
Testing • Strengthen the Testing Group • Identify leader and a small number of full-time testers • Assemble and maintain test suite integrated with autobuild tools • Automate installation and configuration of software releases • be able to auto install & configure a release on a pre-defined small example site • Needs improvements by M-ware WPs to simplify and complete installation & configuration of their software
Technical Management • Architecture group documenting testbed 2 architecture • http://doc.cern.ch/archive/electronic/other/agenda/a021130/a021130s4t1/TB2Arch_v0_1.doc • Project Tech. Board addresses deliverables and relationships with other projects • Meets once per quarter • Need more frequent technical management forum • Authority to make technical & architectural decisions affecting sw development in WPs • This will be done by a refocused weekly WP managers’ meeting.
Testbed Support • Strengthen user support group • People involved need sufficient knowledge of the software • Emphasis on the usefulness of the responses provided • Tools used for support are a secondary issue • Federate with equivalent groups from other projects • Clarify & document procedures • Site Installation (site managers & ITeam) • Steps for system manager and requirements for a site to join the testbed
Autobuild Procedure On RH6.2 • All but ~5/30 packages build and are packaged without errors. On RH7.2 • Around 10/30 packages fail the “make install” step. • All fail “make rpm” because of rpm command change. Warning: • Won’t integrate packages unless autobuild procedure works.
Globus • Globus 2.1.2 • Has fix from CONDOR of GASS Cache problem. • Lot of work to apply to beta-21. • Includes many additional changes job manager. • WP1 logging patches no longer work. • Whole LRMS backend changes to perl framework. • Globus 2.2 • Exists, but… • Some question about what is happening with the MDS 2.2 in this. • Will make Globus Release-24c and test.
Releases 1.2 • EDG 1.2 series NOT suitable for widespread deployment. • EDG 1.2.0 • Available now, known limitations. • EDG 1.2.1 • Deploying now: long jobs, low submission rate. • Deploy multiple resource brokers to reduce problem. • EDG 1.2.2 • GDMP replication fix. • Quick upgrade for sites at 1.2.1. • EDG 1.2.x • Other critical fixes, but very high threshold now.
Releases > 1.2 • EDG 1.3 series will be widely deployed. • EDG 1.3.0 • Upgrade Globus—will contain GASS cache patch from Condor. • Hopefully will also have MDS 2.2. • Subject to testing, will be deployed on application testbed. • EDG 1.3.x • Clean up LCFG objects/configuration. • Modify setup to support new developer guidelines. • EDG 1.4 series will begin incremental inclusion of new middleware functionality • EDG 1.4.x
Some possible increments • New LCFG - WP4 • GridFTP server access to MSS - WP5 • Giggle & Reptor – WP2 • LCAS with dynamic plug-in modules – WP4 • NetworkCost Function – WP7 • Integrate mapcentre (nordugrid?) and R-GMA – WP3 • GLUE modified info providers/consumers – WP1,4,5 • Res. Broker – WP1 • LCFG for RH 7.2 – WP4 • Integration with Condor as batch system – WP4
Documentation • Release Notes: • Exist for 1.2.0 (will be updated for 1.2.1). • User’s Guide: • Exists, but should be considered draft. • please use Bugzilla for comments • Installation Guide: • Won’t be rewritten until Globus upgrade. • Tutorial materials also available.