1 / 15

USCMS Tier-2 in Wisconsin

USCMS Tier-2 in Wisconsin. UW High Energy Physics Dan Bradley Sridhara Dasu Vivek Puttabuddhi Steve Rader Don Reeder Wesley Smith UW Computer Science Miron Livny Sean Murphy Erik Paulson Alain Roy + The Condor Team. Users of Wisconsin Tier-2.

van
Download Presentation

USCMS Tier-2 in Wisconsin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. USCMS Tier-2 in Wisconsin • UW High Energy Physics • Dan Bradley • Sridhara Dasu • Vivek Puttabuddhi • Steve Rader • Don Reeder • Wesley Smith • UW Computer Science • Miron Livny • Sean Murphy • Erik Paulson • Alain Roy • + • The Condor Team

  2. Users of Wisconsin Tier-2 • Focus has been on trigger studies and datasets more easily produced outside of official production channels. • CMS Users come from institutions world-wide • Login to use our resources

  3. Datasets in UW-HEP dCache • At this time, only locally simulated data: 11 TB 39 datasets 360k files http://www.hep.wisc.edu/cgi-bin/cms/CMSJug.cgi

  4. Reconstructed Events Digitized Local Datasets Production History for Local and Official Digitization

  5. Analysis Objects • Primary analysis job is L1 Trigger Ntuple maker

  6. Condor Campus Condor Flocks

  7. GLOW Equipment @ HEP • GLOW equipment in 3 racks: • Storage Servers: These are in addition to older 70 node 2.4 GHz Xeon CPU servers and 10 TB RAID5 servers

  8. Condor Configuration • UW-HEP • peaceful preemption of resource claims achieved with MaxJobRetirementTime=4 days • This requires Condor 6.7.x, which is still a development branch. GLOW • Each group has highest priority on a fixed set of machines. (Achieved with machine RANK) • Wish list: hierarchical matchmaking so groups can divide resources internally. • Idle machines are distributed via Condor’s usual fair sharing algorithm (with preemption). • Bulk of our resources are used by others on GLOW • Some groups use long-running job slots so their jobs are suspended instead of being killed during preemption. Others use Condor’s checkpoint libs.

  9. Grid Services • Currently grid3 based. • Gatekeeper: cmsgrid.hep.wisc.edu • Overflow jobs flock to GLOW and CS • All compute nodes are currently RH EL3 compatible, but we cannot rely on common Linux version in future. • Recently solved several stability problems and have been sustaining modest load (100-200 running jobs) with no problems. • AFS provides cross-campus shared filesystem for grid jobs. Will upgrade to OSG in mid to late May.

  10. dCache Issues • Our SRM service is not yet functional • It used to work in RPM 1.2.2-4, so something went wrong in our upgrade to 1.2.2-7-3. • Interim solution has been 3rd party gridftp to Fermilab, but this service has degraded compared to what used to be available via cmsgridftp.fnal.gov (20 MB/s sustained vs. negligible). • Problems scaling digi with PU ( DSTs) • CPU usage drops to 0 at around 100 digi jobs and degrades badly at even a fraction of that. • xrootd running on all same hardware accessing the same files runs 280 jobs without noticeable CPU fall-off. • dCache was moving 70 times as much data per event!? However, we can’t get it to scale to a level where pool nodes are load bound anyway. • Clearly we have something badly tuned!

  11. Internal Services • Extensive Nagios and NRG-based monitoring • http://noc.hep.wisc.edu/nagios/ • http://noc.hep.wisc.edu/nrg/ • e.g. alerts about degraded RAIDs, dCache pool services, load on gatekeeper, temperature in machine room, etc. • System software managed by kickstart + cfengine • Has worked well for us, but it puts us at odds with the dream of a common ROCKS-based solution. • JugMaster data production • http://www.hep.wisc.edu/cgi-bin/cms/CMSJug.cgi • Basic idea: “persistent DAG in a database” • Provides highly scalable queue that may in turn submit jobs to multiple Condor schedds in a fault-tolerant way. • This service will become less important as MCPS is used.

  12. Immediate TODO list • We have today only a taste of Tier-2 like operations, with ~17 users. • We need to understand why some users are less active. Is it something we can fix? • We must solve our SRM deployment issues and get connected to the PhEDEx network. • Ability to export/import official data • Scale grid based simulation production to the levels previously sustained (>500 simultaneous jobs) • Resume locally managed production if necessary (i.e. if Craig is too busy)

  13. Tier-2 Recruiting/Purchasing • Crucial item is to fully staff Tier-2 • System Manager: S. Rader 50% leveraged from UW DOE and 50% Tier-2 appointment • Use savings to hire a technical assistant • Good experience in the past: Raj, Iyer, Vivek • Operations: new person recruited • We will train this person to lead operations • Physics analysis jobs and simulation support • We will begin recruitment of a software developer to work on DISUN issues with Condor team • Equipment • Must demonstrate that we can efficiently and fully utilize resources that we have before new purchases • Later this year we will commission a new server room for Tier-2 equipment • Expect new equipment purchases in Fall

  14. Related Activities • Trigger Fault Studies • Working with Knowledge Management group: B. Chen, L. Chen, R. Ramakrishnan • Trying to automatically detect when trigger behavior changes unexpectedly. • Can identify some possible causes. • Rapid-response Adaptive Computing Environment: RACE • Support burst computations for high priority tasks. • Want to claim full UW campus grid on short notice, ~7000K SI2000 in 2007. • Provide Glidein-style workspace.

  15. Middleware • VDT • Foundation for Grid3 and OSG • Integrating Globus, Condor, MonALISA, and many others. • NMI Middleware Initiative • Synergy with VDT. • Common build and testing system for many platforms • Condor • Condor-G, Condor-C, DAGMan

More Related