80 likes | 200 Views
Version 1.0 (meeting edition) 04 December 2008 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg 12. Outline. Summary Deployment 1 completed successfully before Thanksgiving, including Reco/MC Prod split.
E N D
Version 1.0 (meeting edition) 04 December 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 12 D0 Grid Data Production
D0 Grid Data Production Outline • Summary • Deployment 1 completed successfully before Thanksgiving, including Reco/MC Prod split. • Some follow-up (to be discussed), but nominally working • Deployment 2 in just 2 weeks… not much time. • News • Any to report? • Deployment 1 Follow-up • Deployment 2 Task Lists
D0 Grid Data Production Deployment 1 Status, Follow-up • Deployment 1: Split Data/MC Production Services – Completed, with follow-up on: • a. Were Queuing nodes setup with older installer product (lacking gridmap auto-updates)? • b. QUE2 health? Monitoring enabled? (I could not see on ganglia) • c. FWD3 not rebooted yet, so have not picked up ulimit-FileMaxPerProcess • d. New Condor version: evidence yet that Periodic Expression no longer blocking system? • e. Integrating experience into installation procedures and formalize hand-off from dev to ops. • f. (New issue) fcpd version, monitoring, and restart mechanism • Notes:
D0 Grid Data Production Deployment 1 (2) Configuration(adapted from Oct 6 proposal, tweaked in early Nov.) • Reco • FWD1: GridMgrMaxSubmitJobs/Resource = 1250 (was 750, default 100) • FWD5: 1250 • Future: FWD6 replaces FWD5 here • MC, MC Merge • FWD2: 1250 (was 750) • FWD4: 1250 • Future: MC Merge moved to only FWD5 • Reco Merge • FWD3: 750/300 grid each • QUE1:Reco, Reco Merge – keep here to maintain history • QUE2:MC, MC Merge - not used by MC Prod at first, now is. • SAM Station: All job types • Jim Client: can submit to QUE1 or QUE2 depending on qualifier
D0 Grid Data Production Deployment 2 “Feature” List • Deployment 2: Optimize Data and MC Production Configurations after splitting of services in deployment 1 • Time frame: December 8-10, with 1 week+ observation before holidays • What is still feasible given the 2 weeks to completion? • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length to reduce number of “touches” per day. • 2. New SAM-Grid Release with support for new Job status value at Queuing node • 3. Deploy FWD6 (Samgfwd06). This takes over current FWD5 role, and FWD5 becomes MC Merge FWD node. • FWD2,4 appear CPU limited, so this may help by moving MC Merge load off. • 4. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5 • Not in project plan yet. • 5. Formalize transfer of support for QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade) • z
D0 Grid Data Production Task Status (1 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product "GG,AL"Mon 10/27/08 Fri 12/5/08 28d • 1.1.8.16 New FWD Install Proc/Doc hand-off to REX/Ops AL AL Tue 11/18/08 Fri 11/21/08 4d • 1.1.8.6 Umbrella Product: Update FWD Installation Procedure AL JB Mon 12/1/08 Tue 12/2/08 2d • 1.1.8.14 Add ulimitOpenFileMax setting to FWD Installation Procedure AL REX Mon 12/1/08 Mon 12/1/08 1d • 1.1.8.15 New QUE Install Proc/Doc hand-off to REX/Ops AL AL Tue 11/18/08 Fri 11/21/08 4d • 1.1.8.10 Umbrella Product: Update QUE Installation Procedure AL JB Mon 12/1/08 Tue 12/2/08 2d • 1.1.8.13 Umbrella Product: FWD and QUE Installation Procedures archived AL REX Wed 12/3/08 Wed 12/3/08 1d • 1.1.8.17 "Umbrella Product: FWD, QUE Auto-Maint/Monitoring into a package" AL AL Thu 12/4/08 Fri 12/5/08 2d • 1.1.8.11 Milestone: FWD, QUE Pkging with Version-Based Umbrella Prod done "GG,AL" Fri 12/5/08 Fri 12/5/08 0d • Tasks to accomplish some of the follow-up are not relabeled as such above. • 1.1.14 Forwarding Node 6 (Fwd6) --- NEW --- AL Mon 12/1/08 Fri 12/12/08 10d • 1.1.14.1 Fwd6: Server Hardware OS Install AL FEF Mon 12/1/08 Tue 12/2/08 2d • 1.1.14.2 Fwd6: Increase ulimitOpenFileMax to 16k AL FEF Wed 12/3/08 Wed 12/3/08 1d • 1.1.14.3 Fwd6: Server Hardware Burn-in AL FEF Thu 12/4/08 Fri 12/5/08 2d • 1.1.14.4 Fwd6: Verify Platform Installation AL JB Thu 12/4/08 Fri 12/5/08 2d • 1.1.14.5 Fwd6: Request and Install Grid Certs AL JB Thu 12/4/08 Mon 12/8/08 3d • 1.1.14.6 Fwd6: Install with Version-Based FWD Umbrella Product AL JB Tue 12/9/08 Tue 12/9/08 1d • 1.1.14.7 Fwd6: Single Job Small-Scale Test AL JB Wed 12/10/08 Wed 12/10/08 1d • 1.1.14.8 Fwd6: Large-Scale Tests AL "JB,MD,JS"Thu 12/11/08Fri 12/12/08 2d • 1.1.14.9 "Fwd6: Setup Automated Maintenance, Monitoring" AL JB Thu 12/11/08 Fri 12/12/08 2d • 1.1.14.10 Milestone: Fwd6 Ready to Deploy AL Fri 12/12/08 Fri 12/12/08 0d • Notes:
D0 Grid Data Production Task Status (2 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info • 1.3.1.7 New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5d • 1.3.1.1 Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3d • 1.3.1.3 SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2d • 1.3.1.6 Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5d • 1.3.1.4 Upgrade D0Runjob version used by Data Production AL "MD,AL"Thu 10/30/08Fri 10/31/08 2d • 1.3.1.5 Milestone: SAM-Grid Release Deployable for Data ProductionAL REX Fri 12/5/08 Fri 12/5/08 0d • 1.4 Metrics • nSubmissions plot for Sep ’08 Mike? • CPU/Wall time for d0farm is down to 84% • “integrated slots used” percentage Sep = 89% >> 64%... More to low events/day than just the grid layer downtime. • … so I started poking around: • d0cabsrv2 is swapping much more than ‘1, 25% of swap used. It has only 2 GB memory compared to 4 GB for ‘1. Add memory • QUE2 missing from ganglia? No info reported other than host is up. • Notes
D0 Grid Data Production Task Status (2 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • Tasks not Planned Yet • OS Upgrades – time enough to perform or too much disruption in a short time? • Follow-up Tasks, esp. Monitoring – seem high priority given fcpd downtime • Post-Deployment topics and tasks to be covered in the “Deployment 1 Review” • Archiving Installation Instructions with all note-worthy comments in JIRA integrated • Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated • Cost-benefit: push FWD, QUE nodes to be appliances • spec’d from OS (including ulimit-FileMaxPerProcess settting) to applications to grid system configuration • rapid wipe and re-install • 1.1.11 Deployment 1 Review AL Tue 12/2/08 Tue 12/2/08 1d • When?? • 1.1.7 Deployment Stage 2 • 1.1.7.1 "Deployment 2: Plan: Optimize Data, MC Prod Configurations"AL ALL Wed 12/3/08 Fri 12/5/08 3d • 1.1.7.2 Deployment 2: Execute AL REX Mon 12/8/08 Wed 12/10/08 3d • 1.1.7.3 Deployment 2: Monitor AL REX Thu 12/11/08 Tue 12/16/08 4d • 1.1.7.8 Deployment 2: Complete Grid Production Configuration AL REX Wed 12/17/08Wed 12/17/08 1d • 1.1.7.4 Deployment 2: Sign-off AL REX Thu 12/18/08 Thu 12/18/08 1d • 1.1.7.5 MILE 2: Deployment 2 Completed AL Thu 12/18/08 Thu 12/18/08 0d • 1.1.13 Deployment 2 Review AL Fri 12/19/08 Fri 12/19/08 1d • Push to January??