1 / 8

D0 Grid Data Production Initiative: Coordination Mtg 12

Version 1.0 (meeting edition) 04 December 2008 Rob Kennedy and Adam Lyon Attending: …. D0 Grid Data Production Initiative: Coordination Mtg 12. Outline. Summary Deployment 1 completed successfully before Thanksgiving, including Reco/MC Prod split.

toviel
Download Presentation

D0 Grid Data Production Initiative: Coordination Mtg 12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Version 1.0 (meeting edition) 04 December 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 12 D0 Grid Data Production

  2. D0 Grid Data Production Outline • Summary • Deployment 1 completed successfully before Thanksgiving, including Reco/MC Prod split. • Some follow-up (to be discussed), but nominally working • Deployment 2 in just 2 weeks… not much time. • News • Any to report? • Deployment 1 Follow-up • Deployment 2 Task Lists

  3. D0 Grid Data Production Deployment 1 Status, Follow-up • Deployment 1: Split Data/MC Production Services – Completed, with follow-up on: • a. Were Queuing nodes setup with older installer product (lacking gridmap auto-updates)? • b. QUE2 health? Monitoring enabled? (I could not see on ganglia) • c. FWD3 not rebooted yet, so have not picked up ulimit-FileMaxPerProcess • d. New Condor version: evidence yet that Periodic Expression no longer blocking system? • e. Integrating experience into installation procedures and formalize hand-off from dev to ops. • f. (New issue) fcpd version, monitoring, and restart mechanism • Notes:

  4. D0 Grid Data Production Deployment 1 (2) Configuration(adapted from Oct 6 proposal, tweaked in early Nov.) • Reco • FWD1: GridMgrMaxSubmitJobs/Resource = 1250 (was 750, default 100) • FWD5: 1250 • Future: FWD6 replaces FWD5 here • MC, MC Merge • FWD2: 1250 (was 750) • FWD4: 1250 • Future: MC Merge moved to only FWD5 • Reco Merge • FWD3: 750/300 grid each • QUE1:Reco, Reco Merge – keep here to maintain history • QUE2:MC, MC Merge - not used by MC Prod at first, now is. • SAM Station: All job types • Jim Client: can submit to QUE1 or QUE2 depending on qualifier

  5. D0 Grid Data Production Deployment 2 “Feature” List • Deployment 2: Optimize Data and MC Production Configurations after splitting of services in deployment 1 • Time frame: December 8-10, with 1 week+ observation before holidays • What is still feasible given the 2 weeks to completion? • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length to reduce number of “touches” per day. • 2. New SAM-Grid Release with support for new Job status value at Queuing node • 3. Deploy FWD6 (Samgfwd06). This takes over current FWD5 role, and FWD5 becomes MC Merge FWD node. • FWD2,4 appear CPU limited, so this may help by moving MC Merge load off. • 4. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5 • Not in project plan yet. • 5. Formalize transfer of support for QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade) • z

  6. D0 Grid Data Production Task Status (1 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product "GG,AL"Mon 10/27/08 Fri 12/5/08 28d • 1.1.8.16 New FWD Install Proc/Doc hand-off to REX/Ops AL AL Tue 11/18/08 Fri 11/21/08 4d • 1.1.8.6 Umbrella Product: Update FWD Installation Procedure AL JB Mon 12/1/08 Tue 12/2/08 2d • 1.1.8.14 Add ulimitOpenFileMax setting to FWD Installation Procedure AL REX Mon 12/1/08 Mon 12/1/08 1d • 1.1.8.15 New QUE Install Proc/Doc hand-off to REX/Ops AL AL Tue 11/18/08 Fri 11/21/08 4d • 1.1.8.10 Umbrella Product: Update QUE Installation Procedure AL JB Mon 12/1/08 Tue 12/2/08 2d • 1.1.8.13 Umbrella Product: FWD and QUE Installation Procedures archived AL REX Wed 12/3/08 Wed 12/3/08 1d • 1.1.8.17 "Umbrella Product: FWD, QUE Auto-Maint/Monitoring into a package" AL AL Thu 12/4/08 Fri 12/5/08 2d • 1.1.8.11 Milestone: FWD, QUE Pkging with Version-Based Umbrella Prod done "GG,AL" Fri 12/5/08 Fri 12/5/08 0d • Tasks to accomplish some of the follow-up are not relabeled as such above. • 1.1.14 Forwarding Node 6 (Fwd6) --- NEW --- AL Mon 12/1/08 Fri 12/12/08 10d • 1.1.14.1 Fwd6: Server Hardware OS Install AL FEF Mon 12/1/08 Tue 12/2/08 2d • 1.1.14.2 Fwd6: Increase ulimitOpenFileMax to 16k AL FEF Wed 12/3/08 Wed 12/3/08 1d • 1.1.14.3 Fwd6: Server Hardware Burn-in AL FEF Thu 12/4/08 Fri 12/5/08 2d • 1.1.14.4 Fwd6: Verify Platform Installation AL JB Thu 12/4/08 Fri 12/5/08 2d • 1.1.14.5 Fwd6: Request and Install Grid Certs AL JB Thu 12/4/08 Mon 12/8/08 3d • 1.1.14.6 Fwd6: Install with Version-Based FWD Umbrella Product AL JB Tue 12/9/08 Tue 12/9/08 1d • 1.1.14.7 Fwd6: Single Job Small-Scale Test AL JB Wed 12/10/08 Wed 12/10/08 1d • 1.1.14.8 Fwd6: Large-Scale Tests AL "JB,MD,JS"Thu 12/11/08Fri 12/12/08 2d • 1.1.14.9 "Fwd6: Setup Automated Maintenance, Monitoring" AL JB Thu 12/11/08 Fri 12/12/08 2d • 1.1.14.10 Milestone: Fwd6 Ready to Deploy AL Fri 12/12/08 Fri 12/12/08 0d • Notes:

  7. D0 Grid Data Production Task Status (2 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info • 1.3.1.7 New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5d • 1.3.1.1 Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3d • 1.3.1.3 SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2d • 1.3.1.6 Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5d • 1.3.1.4 Upgrade D0Runjob version used by Data Production AL "MD,AL"Thu 10/30/08Fri 10/31/08 2d • 1.3.1.5 Milestone: SAM-Grid Release Deployable for Data ProductionAL REX Fri 12/5/08 Fri 12/5/08 0d • 1.4 Metrics • nSubmissions plot for Sep ’08 Mike? • CPU/Wall time for d0farm is down to 84% • “integrated slots used” percentage Sep = 89% >> 64%... More to low events/day than just the grid layer downtime. • … so I started poking around: • d0cabsrv2 is swapping much more than ‘1, 25% of swap used. It has only 2 GB memory compared to 4 GB for ‘1. Add memory • QUE2 missing from ganglia? No info reported other than host is up. • Notes

  8. D0 Grid Data Production Task Status (2 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • Tasks not Planned Yet • OS Upgrades – time enough to perform or too much disruption in a short time? • Follow-up Tasks, esp. Monitoring – seem high priority given fcpd downtime • Post-Deployment topics and tasks to be covered in the “Deployment 1 Review” • Archiving Installation Instructions with all note-worthy comments in JIRA integrated • Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated • Cost-benefit: push FWD, QUE nodes to be appliances • spec’d from OS (including ulimit-FileMaxPerProcess settting) to applications to grid system configuration • rapid wipe and re-install • 1.1.11 Deployment 1 Review AL Tue 12/2/08 Tue 12/2/08 1d • When?? • 1.1.7 Deployment Stage 2 • 1.1.7.1 "Deployment 2: Plan: Optimize Data, MC Prod Configurations"AL ALL Wed 12/3/08 Fri 12/5/08 3d • 1.1.7.2 Deployment 2: Execute AL REX Mon 12/8/08 Wed 12/10/08 3d • 1.1.7.3 Deployment 2: Monitor AL REX Thu 12/11/08 Tue 12/16/08 4d • 1.1.7.8 Deployment 2: Complete Grid Production Configuration AL REX Wed 12/17/08Wed 12/17/08 1d • 1.1.7.4 Deployment 2: Sign-off AL REX Thu 12/18/08 Thu 12/18/08 1d • 1.1.7.5 MILE 2: Deployment 2 Completed AL Thu 12/18/08 Thu 12/18/08 0d • 1.1.13 Deployment 2 Review AL Fri 12/19/08 Fri 12/19/08 1d • Push to January??

More Related