200 likes | 326 Views
Version 1.0 (meeting edition) 16 April 2009 Rob Kennedy and Adam Lyon Attending: RDK, …. D0 Grid Data Production Initiative: Coordination Mtg. Overview. News and Summary All-CAB2 Processing Ended 4/13 Backlog reached minimum of 21 MEvts (end of day 4/13), now at 23 MEvts (eod 4/15)
E N D
Version 1.0 (meeting edition) 16 April 2009 Rob Kennedy and Adam Lyon Attending: RDK, … D0 Grid Data Production Initiative:Coordination Mtg D0 Grid Data Production
D0 Grid Data Production Overview • News and Summary • All-CAB2 Processing Ended 4/13 • Backlog reached minimum of 21 MEvts (end of day 4/13), now at 23 MEvts (eod 4/15) • One of end conditions “until 1 week backlog” ~ 35 MEvts achieved EARLY. • Ending > 2 weeks before deadline of 4/30/2009. • Request by Qizhong ~noon on 4/13. System redefined to have 1856 slots. • Post-All-CAB2 Operations • Archive script ran amok, samgrid.fnal.gov “out of action” • Some of those slots initially held, then opened up for use. • Agenda • News and Status • All-CAB2 Processing summary: Adam • Next Steps • AOB
D0 Grid Data Production All-CAB2 Processing Issues (1) • Issues since PBS Upgrade 3/10 on CAB • CAB stability issues and the Torque downgrade plan (JA/LH) --- Discussion • See Jason’s slides at: https://d0srv096/mediawiki/index.php/CAB_Server_Downgrade • JA presented list of issues, including “Completed jobs not always updated on pbs_server” • Even with a restart of PBS, slots are not freed up. Eventually does get back in sync, but.... • Maui disconnects from Torque: also causes jobs to start slowly overall. • Defect in PBS source code causes pbs_moms to segfault If pbs_server restarted. • Some of these problems are new, many are not. The first alone is a deal breaker. Reported by other PBS users. • JA proposes a Torque downgrade. Comparison done with LQCD PBS use. • Proposing downgrade of Torque on cabsrv1 and not on cabsrv2 (at first). • 4/16: Follow-up on status... • New: BlueArc impacts FermiGrid (d0cabosg2), impacts D0 production • Occurred 3/27 and 4/4 – leads to jobs no longer being submitted (not death of running jobs) • FermiGrid and Storage have met on this issue, and developing a plan. Vulnerability is not a simple problem. • Root cause is very rapid I/O use pattern, including simple open/close over N files. • New: Metadata file write failure • MD: About 10 times/day, now distinct from 063-caused failures. Almost like directory gone, but not enough information to make conclusion (and timing unlikely that this is actual cause).
D0 Grid Data Production All-CAB2 Processing Issues (2) • Forward Node load balancing issue (AL) – appears to be mitigated. (Fix to be deployed) • Condor Dev found mistake in Condor Negotiator implementation: not round robin in code. • Simplify new Condor deployments with separate Condor/VDT layers (part of Phase 2 plan) • Short-term mitigation/work-around: “Randomize rank” seems to work still • Not clear that we and dev have complete understanding of ranking implementation, timing, etc. • FWD4, FWD5 swapped in Data, MC Production to handle greater load in All-CAB2 processing. • Decrease frequency of queries? (in Condor) – We chose last week to defer, but… • High load on FWD4 observed, related to the pings (certainly would have sunk FWD5… good swap) • Pings seem to backup after passing through some threshold. • Tune max grid jobs per fwd node – FWD1 75. FWD4 115. DONE. • FWD4 handled 100 over weekend. FWD1 is less powerful (4 cpu cores vs. 8 cpu cores). • D0srv063 Repair/Replace (FEF via tickets) – appears to be mitigated. • Crash episodes some weeks ago. Kernel panics in XFS code. None since kernel upgrade. • Services did not restart after reboots… and appeared to be setup to do so. Why not? • Perhaps the nfs-mounted UPS area is not coming up in time for the boot process? • Data flow reworked to avoid ‘063 use for unmerged TMB… go to ‘065, ‘071, or ‘072. Borrowed/added ‘077 as additional SAM cache storage location. ‘063 only used now for merged TMB on way to tape. • Look at uptime to determine long-term disposition of ‘063. (Deliberately rebooted April 2).
D0 Grid Data Production All-CAB2 Processing Issues (3) • Station and Context Server load issues (AL,RI) • Issue 1: Station log > 2GB causes station crash. Fixed. 19 Mar 2009 by RI. • Issue 2: Memory used by Context Server too great. Fixed by RI • Issue(s) 3: Merge jobs • A. Restart of station seemed to kill a bunch of merge jobs. • B. Unable to get job tarball when ’072 filled up. • New Issue: “Random” crash due to project id value coincidence. Accept risk for now. Development possible to fix. • Issue 4: Is Data Handling performing as needed to supply All-CAB2 system with data? Slow ‘071 Transfers? • A fiber channel cable was replaced in ‘071. Adam will look at plots to see if this resolves the issue. • New QUE/FWD Issues (Fix to be deployed) • New Issue: Samgrid.fnal.gov got “stuck” 3/23. Collector trying to talk to Condor at luhep. Socket got stuck and hung the collector which hung the monitoring. Bug fixed in Condor 7.2.1. • Disk space on ‘071, ‘072 (AL,RI) • SAM error: Project already has delivered files. Related to both tapes offline as well as disk space issues. Statement means “No more files can be delivered” rather than “no more files to be delivered”. • More partition space assigned to one cache area and led to oversubscription in SAM. Not supposed to happen, but did. Could we alarm on 100% full, even though 99% full is commonplace? … 3/26 – these are resolved (AL) • Overall: system is well-matched and “random” little problems dominate failures • Effort intensive to create a histogram of remaining failure modes in detail. May get some guidance from XML data.
D0 Grid Data Production Post-All-CAB2 Tasks • Configuration of D0GDP system to return to • Data flow reworked to avoid ‘063 use for unmerged TMB… go to ‘065, ‘071, or ‘072. Borrowed/added ‘077 as additional SAM cache storage location. ‘063 only used now for merged TMB on way to tape. Keep mostly as is – Except: • Restore ‘077 to its former role as CAB2 disk cache • Keep Forwarding node assignments Keep as is … might as well be ready for additional CPU coming in summer ’09. • Forwarding node and queue limits remain the same (other than the obvious CAB2 queue values) Keep as is • AND take opportunity to: • Check that all configuration is stored/backed-up, easily recoverable. • General Tasks to Close-out All-CAB2 Processing • Gather Summary of Processing – Adam • Arrange Executive Summary with D0 Spokes – Rob (Apr 17 or May 1?)
D0 Grid Data Production High-Level Schedule Proposal • April 2009 • All-CAB2 Data Processing – March 12 through April 13… DONE • Scale-back by end of April, re-establish “normal” operations albeit with deeper queues, etc. • Summarize issues and accomplishments … and follow-up with executive summary. • 1. CAB Config Review – Early to Mid April … slipping to late April • Cover topics listed in backup slide (slide 16, see amber bullet “review”) • Some topic bullet points may be moot by this time… will help for people to consider these from their point of view earlier. • Production queue with minimum allotment and may-be a maximum, and then allow “opprotunistic” use of “analysis” CPU • Needs string arg handling resolved which also opens up opportunistic use of other farms. • Add a high priority production queue to allow special requests to move ahead of standard production processing • 2. Monitoring Workshop – Late April… slipping to early May. • Assess what we all have now, where our gaps are, what would be most cost-effective to address • See Gabriele’s white paper on D0 grid job tracking (includes monitoring, focus on OSG). In draft form now. • May 2009 • 3. Release new SAMGrid with added state feature – Early May • 3. Upgrade production release of Condor with fixes – Early May … modify Condor/VDT upgrade procedure now or later or ever? • Initiative Close-Out processes, including Review, Workshop follow-up – Mid May
D0 Grid Data Production Backup Slides Not expected to present at meeting
D0 Grid Data Production Near-Term Work: Action Items (1) • Merge Processing slowdowns (MD, RI) • … suspect issue in SAM station. DB server related possibly??? Has not done this in a while, and still no news. (This is the Saturday night fun job for RI – so far so good.) Is this related to the load or usage pattern change since many analyses completed? • Test Data Production with Opportunistic Usage (MD) – no news this week • MD: Can submit to a specific cluster, preferred by FSG to submit more generally with a constraint string. Working on escaping the string to get it all the way through (‘/’ mangled at ClassAds stage?). PM helping. • Recent: GPFarms and CDF nodes worked, but recent test of CMS nodes test did not work any more. • All-CAB2 load has pushed the D0GDP system to the edge, however. • We choose to defer large-scale work in this area until after the All-CAB2 processing, realizing that opportunistic resources will be less available by that time. • CDF pre-emption is conditional on consistent demand for the slot… only THEN does a job get killed after 48 hours. Timer starts with the consistent demand. • CDF Node deployment (FEF): Almost Done. • Put these last few CDF worker nodes into system once PBS head nodes upgraded. Low priority. • JA: Desire to have a PBS test stand for validating new releases, testing off to the side.
D0 Grid Data Production Near-Term Work: Action Items (2) • Condor and/or Globus Driven Issues • Condor: Follow-up on Qedit defect – Condor 7.2 chain has the fix, tested by PM. • Fix will not be released in the 7.0 branch since not a security issue. • Globus: Address “stuck state” issue affecting both Data and MC Production – PM has some examples. • One instance (MC Prod) is due to job manager LSF implementation w/GOC ticket created. Similar issue in other job managers? Need to have live “stuck” examples to pass to Condor dev to figure this out. They lack developers to follow-up. • Condor release procedure: Enable REX to deploy Condor as overlay on top of VDT? (or equivalent) • Speed up deployment of Condor fixes since no longer have to wait on VDT production validation cycle to complete. • Decouple this more frequent task from GRID developer schedules. • Timeframe: May 2009. Alternative approaches… binary in place… wait for VDT… etc. • Phase 1 Follow-up • Enable auto-update of gridmap files on Queuing nodes: AL: #2 done. #1 more complicated, not done… no downtime required though. • ST: Why still using gridmap files? Could use callouts now. Development, testing, deployment costs. • Enable monitoring on Queuing nodes – AL: not all done yet. • Phase 2 Notes • WHY were the differences in FWD node config not captured beforehand? (FWD4/5 swap) • Suggestion: specializations could be localized to a standard configuration variant. • Eg: “Data Prod” versus “MC Prod” forward node config • Requirement for future: change config in one place, and propagate from there. • Revive progress on the transfer of QUE1 to FEF from FGS. (Action Item to RDK) – no action yet
D0 Grid Data Production All-CAB2 Processing Plan • March 10 (Tues): PBS Head Node Upgrade – H/w and PBS S/w • Done by FEF. Then, get 2 days experience with upgraded system to isolate “new PBS” issues” from “expanded d0farm” issues. • March 12 (Thu): Increase d0farm over time from ~1800 slots (5.4 MEvts/day) to ~3200 slots (9.5 MEvts/day) • System needs to be stable with new version of PBS first, restore previous service level. All parties sign off, then, begin expansion to All-CAB2. • Mike D. will judge how quickly to push this. Enhanced vigilance: everyone looks for bottlenecks or service problems. • March 16 (Mon): Goal date for d0farm up to full ~3200 slots • All parties assess the system, sign-off on continued running at full size. • March 17 (Tues) forward: Work through data until any one end condition is met. Monitor weekly. • End Conditions: May 1, 2009 –or– down to 1 week backlog remaining –or– D0 analysis users must have slots back. • Assume backlog starts ~ 175 MEvts. Based on ~1800 slot capability at higher recent L values, One week of backlog is about 35 MEvts (estimate). • To process incoming data + 140 MEvts backlog would require about 4-5 weeks of All-CAB2 processing • March 17 – April 28 = 6 weeks, the max allowed (rounding). • Exploit opportunistic CPU usage during this time too. Mitigate risk of higher luminosity leading to slower processing. • We could reach “backlog down to 1 week” end condition assuming everything runs smoothly and TeVatron luminosity is about what it is nowadays after as little as 4 weeks, though more likely after 4.5-5 weeks. Set stretch goal internally for 4 weeks (April 14). • April 14 (Tues): Stretch goal to meet an end condition early (down to 1 week of backlog) • Evaluate end conditions in detail. End now –or– continue for 1 more week –or– continue to max of 2 more weeks? D0 Analysis OK? • April 28 (Tues): Latest sign-off on reverting to original ~1800 slot system. • Insure all parties are ready for necessary changes and potential config tweaks (like FWD nodes able to handle full load each) • April 30 (Thu): Latest date for reverting to original ~1800 slot system. • Allows a day or so of running in normal configuration before weekend.
D0 Grid Data Production All-CAB2 Processing Project • Early-March April 2009: Keep-Up Level + Work-through-Backlog Level • Temp Expanded CAB2 Use by Data Production: 2/20/2009 via Email: Regarding temporarily using the whole CAB2 for the production, D0 management has made a decision that from March 10, we will temporarily expand the d0 farm queue to be the whole CAB2. The purpose is to catch up the backlog in data production for the summer conference. This configuration is temporary. We will change it back to the current configuration when one of the following condition happens: - when the backlog has been reduced to be less than one week of data; or - May 1, 2009, or - when there is an analysis need for more CPUs than CAB1 can provide. Although the configuration change will be done by FEF (thanks to FEF!), the SamGrid team may need to plan to adjust related parameters to handle a much larger production farm. The current d0 farm queue has 1800 job slots. The new d0 farm queue will have 1800+1400 job slots, temporarily. Thank you, Qizhong
D0 Grid Data Production Phase 2 Work List Outline • 2.1 Capacity Management: Data Prod is not keeping up with data logging. • Capacity Planning: Model nEvents per Day – forecast CPU needed • Capacity Deployment: Procure, acquire, borrow CPU. We believe infrastructure is capable. • Resource Utilization: Use what we have as much as possible. Maintain improvements. • 2.2 Availability & Continuity Management: Expanded system needs higher reliability • Decoupling: deferred. Phase 1 work has proven sufficient for near-term. • Stability, Reduced Effort: Deeper queues. Goal is fewer manual submissions per week. • Resilience: Add/improve redundancy at infrastructure service and CAB level. • Configuration Recovery: Capture configuration and artefacts in CVS consistently. • 2.3 Operations-Driven Projects • Monitoring: Execute a workshop to share what we have, identify gaps and cost/benefits. • Issues: Address “stuck state” issue affecting both Data and MC Production • Features: Add state at queuing node (from Phase 1). Distribute jobs “evenly” across FWD. • Processes: Enable REX/Ops to deploy new Condor… new bug fixes coming soon. • Phase 1 Follow-up: Few minor tasks remain from rush to deploy… dot-i’s and cross-t’s. • Deferred Work List: maintain with reasons for deferring work.
D0 Grid Data Production 2.1 Capacity Management • Highest Priority: will treat in Phase 2. Refine work list to be more detailed. • Mostly covered in “short-term work”… • 2.1.1 Capacity Planning: Model nEvents per Day – improve to handle different Tevatron store profiles, etc. • Goal: What is required to keep up with data logging by 01 April 2009? • Goal: What is required to reduce backlog to 1 week’s worth by 01 June 2009? … Use all CAB2 to catch up Mar/Apr • Follow-up: Planning meeting Monday? • Goal: What infrastructure is required to handle the CPU capacity determined above? • Latencies impacted CPU utilization? MD, Adam looking into (consider oversubscription if so) • Ling Ho: 8 core servers (2008) have scratch/data on system disk (only 1 disk). Have observed contention, slowing jobs down. • Data Handling tuning (AL, RI): tuning SAM station, investigate if CPU waiting on data. • 2.1.2 Capacity Deployment: Goals to be determined by capacity planning • Added CDF retired nodes (FEF, REX): 17 Feb 2009 • Upgrade PBS Head nodes (FEF): 10 March 2009 • Plan and execute temporary expansion of “d0farm” on CAB2 to work through backlog in March and into April • Plan and execute some level of opportunistic use of CPU for data production. • Infrastructure capacity: appears sufficient for 25% CPU increase, probably OK for larger increase. • Fcp config: achieved optimal config to max out network bandwidth? Done. • 2.1.3 Resource Utilization: top-down investigation as well as bottom-up investigation. Examples… • Goal: > 90% of available CPU used by Data Prod (assuming demand limited at all times) • Goal: > 90% of available job slots used by Data Prod averaged over time (assuming demand limited at all times) • Goal: TBD… Uptime goal, downtimes limited. Activities to maintain/achieve this overlap with the following…
D0 Grid Data Production 2.2 Availability & Continuity Mgmt • Some tasks should be done, others deferred to after April (post-Initiative). • Initiative will keep a list of these for the historical record • 2.2.1 Decoupling • All work deferred post-Initiative • 2.2.2 Queue Optimization for Stability, High Utilization, Reduced Effort • Goal: System runnable day-to-day by Shifter, supervised by expert Coordinator • Deeper Queues • Increase limits to allow either FWD node to handle the entire load if the other were to go down. • Revisit after PBS head node upgrade, experience with larger capacity system • 2.2.3 Resilience • Eliminate SPOF, add/improve redundancy (see following slides) • 2.2.4 Configuration Recovery • Capture configuration and artefacts in CVS consistently. (continue to do)
D0 Grid Data Production FGS Recommendations Configure d0cabosg[1,2] to have the capability to manage jobs on both d0cabsrv[1,2]. (Yes, do in Phase 2, by FGS, at low priority) Requires a bit of “hackery” in the jobmanager-pbs script, but is doable. Both d0cabosg[1,2] would have jobmanager-d0cabsrv[1,2]. jobmanager-pbs would become a symlink to the appropriate jobmanager-d0cabsrv[1,2]. This would be in addition to work being performed by FermiGrid on high availability gatekeepers. Open up additional slots for opportunistic use on both clusters. (cab1 first) Ideally make all pbs job slots available for opportunistic use by Grid. Research/develop automatic “eviction” policy for pbs when slot is needed by dzero (as with condor). (requires D0 CPB input, should go through QL. This is in-kind for D0 opportunistic use) Review Globus Errors: Adam’s text table of held reasons. Conduct review(s) covering the following topics, plan for future work: Consider extra queues to segregate D0 MC production (J. Snow) from other VO opportunistic usage. Consider adding extra roles in dzero VO (such as /Role=mc or /Role=monte-carlo). Investigate if special-purpose d0farm nodes still needed. FEF, FGS, Grid Dev. If so, do the coding that should have been done long ago to report them accurately. Review the layout of D0 resources advertising to ReSS, in order to see if it can be done in a more uniform way as opposed to the special-case hackery for CAB that is done now. (related to above 2 bullets). Do we want to have VOMS:/dzero/users access rules? Can d0cabsrv1 worker nodes be increased to have 10GB of scratch (like d0cabsrv2 worker nodes already have) instead of 4GB? Irrelevant due to retirements? A “higher RAM per node” pool for large memory jobs? Eliminate specialized queues in favor of priorities to allow greater CPU utilization. Load-balancing/combining CAB1/CAB2 (avoid user’s manually load balancing)
D0 Grid Data Production 2.3 Operations-Driven Projects • 2.3.1 Monitoring • Workshop to assess what we all have now, where our gaps are, what would be most cost-effective to address • Can we “see” enough in real-time? Collect what we all have, define requirements (w/i resources available), and execute. • Can we trace jobs “up” as well as down? Enhance existing script to automate batch job to grid job “drill-up”? • 2.3.2 Issues • Address “stuck state” issue affecting both Data and MC Production – PM has some examples. Update? • Large job issue from Mike? (RDK to research what this was. Was it 2+ GB memory use? If so, add memory to few machines to create a small “BIG MEMORY” pool?) • 2.3.3 Features • Add state at queuing node (from Phase 1 work). Waiting on Condor dev. PM following up on this. GG to try to push this. • FWD Load Balancing: Distribute jobs “evenly” across FWD… however easiest to do or approximate. • 2.3.4 Processes • Enable REX/Ops to deploy new Condor. In Phase 2, but Lower Priority. Condor deployments for bug fixes coming up. • Revisit capacity/config 2/year? Continuous Service Improvement Plan – RDK towards end of Phase 2. • 2.3.5 Phase 1 Follow-up • Enable auto-update of gridmap files on Queuing nodes. Enable monitoring on Queuing nodes. AL Partly done. • Lessons Learned from recent ops experience: <to be discussed> (RDK revive list, reconsider towards end of Phase 2)
D0 Grid Data Production Phase 2 Deferred Work List (1)Do not do until proven necessary and worthwhile • Phase 1 Follow-up Work • Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5. • Only a minor OS version difference. Wait until this is needed to avoid another disruption and risk • 2.2.1 Decoupling • @ SAM station… no proven value after Station bug fix? DEFER • @ Durable Storage… does decoupling this have as much value now? DEFER • Virtualization & FWD4 (FWD5 was used for decoupling, like to retire) DEFER • Estimate for FEF readiness to deploy virtualization? JA: Months away… (starting from 0, using Virtual Iron) • 2.2.2 Queue Optimization for Stability, High Utilization, Reduced Effort • Few Long Batch Jobs holding Grid Job Slots – not an issue since we split QUE/FWD DEFER. • 2.2.3 Resilience • Means to implement or emulate Durable Storage fail-over? (hardwired config in FWD nodes) DEFER • 2.3.2 Issues • Context Server issues? Believe these are cured by cron’d restart by RI. DEFER
D0 Grid Data Production Phase 2 Deferred Work List (2)Do not do until proven necessary and worthwhile • FGS Recommendations • Add slots for I/O bound jobs on worker nodes. (Defer. Not as motivated now by current job mix. May have been motivated by unusual jobs. Needs study. • 1-2 additional slots per system to handle I/O. • Balance dzeropro job slots across both clusters.(Defer, reconsider w/new WN) • Move of corresponding worker nodes probably best, since it would also balance job slots. • This mitigates what the next bullet would attempt to provide outright. Could be part of CAB config review too. • Explore mechanisms for redundant pbs masters. (FEF. Long time scale?) • Will allow access to the worker nodes to continue even if the “primary” pbs master is down. • Monitor the samgrid forwarding nodes, look at the globus errors going into CAB, see if there are any patterns. • Consider extra queues to segregate D0 MC production (J. Snow) from other VO opportunistic usage. • Consider adding extra roles in dzero VO (such as /Role=mc or /Role=monte-carlo for instance). • Investigate if special-purpose d0farm nodes still needed. FEF, FGS, Grid Dev • If so, do the coding that should have been done long ago to report them accurately. • Review the layout of D0 resources advertising to ReSS, in order to see if it can be done in a more uniform way as opposed to the special-case hackery for CAB that is done now. (related to above 2 bullets) • Do we want to have VOMS:/dzero/users access rules? • Can d0cabsrv1 worker nodes be increased to have 10GB of scratch (like d0cabsrv2 worker nodes already have) instead of 4GB?
D0 Grid Data Production Data Flow for Data Production Also on cache nodes: 0-bias skim, LCG cache 0 Tarballs Initiated by Reco Job 1 Enstore LTO4-G SAM Cache d0srv071 d0srv072 2 Raw Data 5 Other data destined for Tape Storage Initiated by Merge Job, Via gridftp In2p3 remote uploads 4 Unmerged TMB Worker Nodes Scratch space 3 6 Durable Store, Stager Space d0srv063 d0srv065 Enstore LTO4-F Merged TMB 7 Durable Storage and Stager Space are on separate partitions Shared w/Analysis Users No automated failover between ‘63, ‘65