100 likes | 187 Views
Version 1.0 (meeting edition) 20 November 2008 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg 11. Outline. Summary and News Deployment “Feature List” Details filling in on December Deployment Task Status (4 slides)
E N D
Version 1.0 (meeting edition) 20 November 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 11 D0 Grid Data Production
D0 Grid Data Production Outline • Summary and News • Deployment “Feature List” • Details filling in on December Deployment • Task Status (4 slides) • Focus on individual task status, what is needed • Deployment 1 Plan • Focus on overall schedule, task order
D0 Grid Data Production Summary and News • Summary • Initiative Deployment 1 Planning Mtg 2 held Monday • New Station move completed successfully • FWD1-3 Upgrade (with FWD4-5 in prd) Wed done, but full service not yet restored – (details to come) • QUE1 Upgrade planned for Thu (today) if all agreed • Focus Today: Resolve FWD1-3, Proceed or not with QUE1 • News and Notes: • ITSM all-day Workshops this Tue-Fri • Running ahead, so Rob K. here today afterall
D0 Grid Data Production Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services (NO CHANGE) • Time frame: November 13-17, with 1 week+ observation before holidays • 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging • 2. Fwd4 deployed (w/o virtualization) • 3. Fwd5 deployed • 4. Que2 deployed, with client software to enable parallel use of 2 QUE nodes • 5. New SAM Station (moved off of FWD1) • 6. Condor 7 via “new” 1.10.1m official release from UWisc • 7. FileMax increase on all Fwd nodes to handle large nJob actions • 8. D0Runjob Upgrade for Data Production: Prerequisite for deploying new SAM-Grid release • Deployment 2: Optimize Data and MC Production Configurations • Time frame: December 8-10, with 1 week+ observation before holidays • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length • 2. New SAM-Grid Release with support for new Job status value at Queuing node • 3. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5 • 4. Formalize transfer of support for QUE1 (samgrid.fnal.gov) to FEF from FGS
D0 Grid Data Production Mon 17 Nov 2008 Mon 17 Nov: Depl Plan Mtg 2, 9am Test FWD4,5, QUE2 w/ new OpenFileMax (MD,…) Backup samgrid products area on FWD1-3. Also /etc/grid-security, globus-gatekeeper (AL) Request OpenFileMax change on FWD1-3 (AL) Plan QUE1 Upgrade in detail (AL,PM) Tues 18 Nov 2008 FWD2 certs expire. Test FWD4,5, QUE2 w/ new OpenFileMax Backup samgrid products area on QUE1. Also /etc/grid-security, globus-gatekeeper; job_queue, job_history (AL,PM) Automated administration/monitoring on QUE1,2: put into a product (AL) Wed 20 Nov 2008 FWD1-3 wipe/re-install via umbrella package (JB); Increase OpenFileMax (FEF/JB) MC Prod uses FWD4 while this happens Data Prod uses FWD5 while this happens Reboot FWD1-3 to pickup OpenFileMax change Any order of FWD work is OK: All at once or seq. If all goes well, then stop, announce, observe. QUE1 work starts next day. Fall-back: restore samgrid products from backup SAM Station: Move context server (RI) to new sam station host and observe. Thu 20 Nov 2008 Coordination Mtg 9am led by Adam Sign off on FWD work, proceed with QUE1 work QUE1 upgrade install via umbrella package (JB) QUE1 has brokering, web page not on QUE2 AL: Be careful NOT to wipe state of old jobs… Brokering, Web page should not be touched. We have not fully tested the new deployment of these. Production can use QUE2 while this happens. This has modest complication of using this queuing node for recovery jobs (impacts Data Prod). If all goes well, then stop, announce. Observe. Fall-back: restore samgrid products from backup Validate the overall configuration matches plan Check all monitoring, automated tasks. Fri 21 – Mon 24 Nov 2008 Observe system in production Tues 25 Nov 2008 Sign-off on D0 Grid Production System Clean-up: Deferred to December Deployment SRM client cert with correct host address OS upgrades (old nodes on SLF 4.5 to SLF 4.7) Deployment 1 Schedule B
D0 Grid Data Production Deployment 1 Configuration(adapted from Oct 6 proposal, tweaked in meeting) • Reco • FWD1: 1250 (now 750) • FWD5: 1250 • MC, MC Merge • FWD2: 1250 (now 750) • FWD4: 1250 • Reco Merge • FWD3: 750/300 grid each • QUE1: Reco, Reco Merge – keep here to maintain history • QUE2: MC, MC Merge • SAM Station: All • Jim Client: can submit to QUE1 or QUE2 depending on qualifier
D0 Grid Data Production Task Status (1 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.1 Forwarding Node 4 (Fwd4) • <Snip some completed tasks> • 1.1.1.10 Fwd4: Pre-Deployment OpenFileMax=16k Large-Scale Test AL "JS,MD,JB" Fri 11/14/08 Tue 11/18/08 3d • 1.1.1.14 "Fwd4: Setup Automated Maintenance, Monitoring" AL JB Wed 11/12/08Fri 11/14/08 3d • 1.1.1.11 Milestone: Fwd4 Ready to Deploy AL Tue 11/18/08 Tue 11/18/08 0d • 1.1.2 Forwarding Node 5 (Fwd5) • <Snip some completed tasks> • 1.1.2.8 Fwd5: Pre-Deployment OpenFileMax=16k Large-Scale Test AL "JS,MD,JB“ Fri 11/14/08 Tue 11/18/08 3d • 1.1.2.13 "Fwd5: Setup Automated Maintenance, Monitoring" AL JB Wed 11/12/08Fri 11/14/08 3d • 1.1.2.9 Milestone: Fwd5 Ready to Deploy AL Tue 11/18/08 Tue 11/18/08 0d • 1.1.3 Queuing Node 2 (Que2) • <Snip some completed tasks> • 1.1.3.14 "Que2: Setup Automated Maintenance, Monitoring" AL REX Thu 11/13/08 Fri 11/14/08 2d • 1.1.3.9 Que2: Integration Test w/2-QUE Client AL JB Thu 11/13/08 Fri 11/14/08 2d • 1.1.3.11 Milestone: Que2 Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d • 1.1.5 New Distinct Sam Station • <Snip some completed tasks> • 1.1.5.7 Milestone: SAM Station Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d • JIRA “Figure out what to do with SRMs” contains “Request and Install SRM-related certs”
D0 Grid Data Production Deployment 1 Tasks (2 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.6 Deployment Stage 1 • <Snip some completed tasks> • 1.1.6.2 Deployment 1: Execute AL REX Fri 11/14/08 Thu 11/20/08 5d • 1.1.6.2.1 "SAM Station: Deactivate old station, Activate new station"AL RI Fri 11/14/08 Fri 11/14/08 1d • 1.1.6.2.2 "FWD1 Upgrade (App,Config,OpenFileMax)" AL JB Wed 11/19/08 Thu 11/20/08 2d • 1.1.6.2.3 "FWD2 Upgrade (App,Config,OpenFileMax)" AL JB Wed 11/19/08 Thu 11/20/08 2d • 1.1.6.2.4 "FWD3 Upgrade (App,Config,OpenFileMax)" AL JB Wed 11/19/08 Thu 11/20/08 2d • 1.1.6.2.5 "QUE1 Upgrade (App,Config)" AL JB Wed 11/19/08 Thu 11/20/08 2d • 1.1.6.2.6 Establish Grid Production Configuration AL REX Thu 11/20/08 Thu 11/20/08 1d • 1.1.6.2.7 SAM Station: Setup Context Server AL RI Wed 11/19/08 Thu 11/19/08 2d • 1.1.6.2.8 Milestone: Deployment 1 Execution done AL Thu 11/20/08 Thu 11/20/08 0d • 1.1.6.3 Deployment 1: Monitor AL REX Fri 11/21/08 Mon 11/24/08 2d • 1.1.6.4 Deployment 1: Sign-off AL REX Tue 11/25/08Tue 11/25/08 1d • 1.1.6.5 MILE 1: Deployment 1 Completed AL Tue 11/25/08Tue 11/25/08 0d • 1.1.11 Deployment 1 Review AL Mon 12/1/08 1d • Not all starts/durations above are sync’d to the latest Monday plan • Meeting on Monday 17 November produced the authoritative schedule (Sched B) • We cannot deploy later than 20 Nov. (Thursday)… no deploy on Friday or holiday week. • New Condor is in this deployment too, all FWD,QUE nodes. THIS is a major risk.
D0 Grid Data Production Task Status (3 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product • 1.1.8.16 New FWD Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 Mon 11/17/08 0d • 1.1.8.6 Umbrella Product: Update FWD Installation Procedure AL JB Mon 11/24/08 Tue 11/25/08 2d • 1.1.8.14 Add OpenFileMax setting to FWD Installation ProcedureAL REX Wed 11/19/08 Wed 11/19/08 1d • 1.1.8.15 New QUE Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 Mon 11/17/08 0d • 1.1.8.10 Umbrella Product: Update QUE Installation Procedure AL JB Mon 11/24/08 Tue 11/25/08 2d • 1.1.8.13 Umbrella Product: FWD and QUE Install. Proc. archivedAL REX Wed 11/26/08 Wed 11/26/08 1d • 1.1.8.11 Milestone: FWD and QUE Packaging with Version-Based Umbrella Product done "GG,AL" Wed 11/26/08 • Notes: …
D0 Grid Data Production Task Status (4 of 4)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info • 1.3.1.7 New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5d • 1.3.1.1 Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3d • 1.3.1.3 SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2d • 1.3.1.6 Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5d • 1.3.1.4 Upgrade D0Runjob version used by Data Production AL "MD,AL"Thu 10/30/08 Fri 10/31/08 2d • 1.3.1.5 Milestone: SAM-Grid Release Deployable for Data Production AL REX Fri 12/5/08 Fri 12/5/08 0d • 1.3.2 Slow Fwd-CAB Job Transition • Note: FileMax change requires a schedd restart (ST). Work into deployment plans. • 1.3.3 Improved H/w Uptime • 1.4 Metrics • nSubmissions plot for Sep ’08 Mike? • Post-Deployment topics and tasks covered in the “Deployment 1 Review” • Archiving Installation Instructions with all note-worthy comments in JIRA integrated • Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated • Cost-benefit: push FWD, QUE nodes to be appliances • spec’d from OS (including OpenFileMax) to applications to grid system configuration • rapid wipe and re-install • Past Notes: At mercy of off-site gridmap updates… need to use the existing automated system to keep all in sync • Also: no remote site has new VDT (which has new VOMS) • No installation instructions for durable locations server. Considering for Phase 2 of Initiative.