150 likes | 276 Views
USCMS Tier-2 in Wisconsin. UW High Energy Physics Dan Bradley Sridhara Dasu Vivek Puttabuddhi Steve Rader Don Reeder Wesley Smith UW Computer Science Miron Livny Sean Murphy Erik Paulson Alain Roy + The Condor Team. Users of Wisconsin Tier-2.
E N D
USCMS Tier-2 in Wisconsin • UW High Energy Physics • Dan Bradley • Sridhara Dasu • Vivek Puttabuddhi • Steve Rader • Don Reeder • Wesley Smith • UW Computer Science • Miron Livny • Sean Murphy • Erik Paulson • Alain Roy • + • The Condor Team
Users of Wisconsin Tier-2 • Focus has been on trigger studies and datasets more easily produced outside of official production channels. • CMS Users come from institutions world-wide • Login to use our resources
Datasets in UW-HEP dCache • At this time, only locally simulated data: 11 TB 39 datasets 360k files http://www.hep.wisc.edu/cgi-bin/cms/CMSJug.cgi
Reconstructed Events Digitized Local Datasets Production History for Local and Official Digitization
Analysis Objects • Primary analysis job is L1 Trigger Ntuple maker
Condor Campus Condor Flocks
GLOW Equipment @ HEP • GLOW equipment in 3 racks: • Storage Servers: These are in addition to older 70 node 2.4 GHz Xeon CPU servers and 10 TB RAID5 servers
Condor Configuration • UW-HEP • peaceful preemption of resource claims achieved with MaxJobRetirementTime=4 days • This requires Condor 6.7.x, which is still a development branch. GLOW • Each group has highest priority on a fixed set of machines. (Achieved with machine RANK) • Wish list: hierarchical matchmaking so groups can divide resources internally. • Idle machines are distributed via Condor’s usual fair sharing algorithm (with preemption). • Bulk of our resources are used by others on GLOW • Some groups use long-running job slots so their jobs are suspended instead of being killed during preemption. Others use Condor’s checkpoint libs.
Grid Services • Currently grid3 based. • Gatekeeper: cmsgrid.hep.wisc.edu • Overflow jobs flock to GLOW and CS • All compute nodes are currently RH EL3 compatible, but we cannot rely on common Linux version in future. • Recently solved several stability problems and have been sustaining modest load (100-200 running jobs) with no problems. • AFS provides cross-campus shared filesystem for grid jobs. Will upgrade to OSG in mid to late May.
dCache Issues • Our SRM service is not yet functional • It used to work in RPM 1.2.2-4, so something went wrong in our upgrade to 1.2.2-7-3. • Interim solution has been 3rd party gridftp to Fermilab, but this service has degraded compared to what used to be available via cmsgridftp.fnal.gov (20 MB/s sustained vs. negligible). • Problems scaling digi with PU ( DSTs) • CPU usage drops to 0 at around 100 digi jobs and degrades badly at even a fraction of that. • xrootd running on all same hardware accessing the same files runs 280 jobs without noticeable CPU fall-off. • dCache was moving 70 times as much data per event!? However, we can’t get it to scale to a level where pool nodes are load bound anyway. • Clearly we have something badly tuned!
Internal Services • Extensive Nagios and NRG-based monitoring • http://noc.hep.wisc.edu/nagios/ • http://noc.hep.wisc.edu/nrg/ • e.g. alerts about degraded RAIDs, dCache pool services, load on gatekeeper, temperature in machine room, etc. • System software managed by kickstart + cfengine • Has worked well for us, but it puts us at odds with the dream of a common ROCKS-based solution. • JugMaster data production • http://www.hep.wisc.edu/cgi-bin/cms/CMSJug.cgi • Basic idea: “persistent DAG in a database” • Provides highly scalable queue that may in turn submit jobs to multiple Condor schedds in a fault-tolerant way. • This service will become less important as MCPS is used.
Immediate TODO list • We have today only a taste of Tier-2 like operations, with ~17 users. • We need to understand why some users are less active. Is it something we can fix? • We must solve our SRM deployment issues and get connected to the PhEDEx network. • Ability to export/import official data • Scale grid based simulation production to the levels previously sustained (>500 simultaneous jobs) • Resume locally managed production if necessary (i.e. if Craig is too busy)
Tier-2 Recruiting/Purchasing • Crucial item is to fully staff Tier-2 • System Manager: S. Rader 50% leveraged from UW DOE and 50% Tier-2 appointment • Use savings to hire a technical assistant • Good experience in the past: Raj, Iyer, Vivek • Operations: new person recruited • We will train this person to lead operations • Physics analysis jobs and simulation support • We will begin recruitment of a software developer to work on DISUN issues with Condor team • Equipment • Must demonstrate that we can efficiently and fully utilize resources that we have before new purchases • Later this year we will commission a new server room for Tier-2 equipment • Expect new equipment purchases in Fall
Related Activities • Trigger Fault Studies • Working with Knowledge Management group: B. Chen, L. Chen, R. Ramakrishnan • Trying to automatically detect when trigger behavior changes unexpectedly. • Can identify some possible causes. • Rapid-response Adaptive Computing Environment: RACE • Support burst computations for high priority tasks. • Want to claim full UW campus grid on short notice, ~7000K SI2000 in 2007. • Provide Glidein-style workspace.
Middleware • VDT • Foundation for Grid3 and OSG • Integrating Globus, Condor, MonALISA, and many others. • NMI Middleware Initiative • Synergy with VDT. • Common build and testing system for many platforms • Condor • Condor-G, Condor-C, DAGMan