100 likes | 171 Views
Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005. DO Status. DO is running SAMGrid for MC production and Reprocessing SAMGrid is a 1 st generation production system Typical configuration, installation and robustness issues are being addressed
E N D
Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005
DO Status • DO is running SAMGrid for MC production and Reprocessing • SAMGrid is a 1st generation production system • Typical configuration, installation and robustness issues are being addressed • LCG SamGrid interoperability proto-type is going well • OSG resource selector will be developed in order to facilitate similar functionality as with LCG Run II Computing Review
CDF Status • CDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in • Running well and usefully in “owner/operator” mode on a few sites • Does not have integrated data handling • May not be handling tarballs • Requires installation on a head node, and outbound node connectivity • Has some legacy security policies to address • CAF is kerberos basedCDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in Run II Computing Review
Why? • Glide-in technology is attractive in many ways. • There is always a certain appeal in the next great thing. • Illustrative of a general tension for the Run II experiments • Competing agendas—difficult for CDF to turn down effort. Italians support Glide-CAF • CDF wants to do analysis on the Grid, and they do not want user interface to change. • Probably could have achieved that requirement other ways, however CDF is also vested in the CAF as an model. • Ultimately probably beneficial to both CDF and DO • If Glide-in works in production on a reasonable time scale, might be able to use • VO specific services support is a motivation for the Edge Services Pre-proposal for OSG—Edge Services will almost certainly benefit DO Run II Computing Review
Run II computing in the LHC Era • Grid is the strategic direction for FNAL CD to meet commitments to Run II, CMS and other stakeholders. • 05 Run II computing review complimented DO and CDF on moving to towards grid models • Run II effort task force acknowledges strategy • Concerns about • Availability of resources, especially disk • Urged to make more formal agreements • “Expenses” involved in operating a production Grid • About heavyweight and nonstandard interfaces on the production system • About real world issues for the prototype • Mitigations • DO and FNAL CD proposing an installation team, supported by the review • Move towards standard interfaces, more robust • Guest Scientist positions could be used to leverage knowledge and expertise—particularly in cases where physics potential would also leveraged. Run II Computing Review
OSG Pre Proposals • The OSG Pre Proposal call was targeted at core functionality • SAMGrid was built with the support of PPDG funds. • Noted that a service without customers is of limited use. • Some calls to work closely with TERAGrid. • Still working through details for a full proposal • Encouraged to make a proposal for an OSG that will thrive! Run II Computing Review
Summary Run II Computing Review
RUN II Department Roles • Operations—Running the systems, standing pager rotations/shifts, researching latest technologies • purchasing and deploying equipment • tracking down and fixing problems • code management • Development—exploring use cases, writing code, introducing new features, testing, documenting, exploring technologies • Integration—testing, more testing, training users, transition from development to operations • Planning—how best to use resources to meet stakeholder needs, facility issues • Interfacing – Serve in experiment management roles, bridging the CD and the experiments, CD department to CD department, hosting guest scientists • Participate in physics analysis as collaboration members -- 30% of department FTEs hold scientific positions Run II Computing Review
Risks, expanded • Increased calls on FNAL CD as migration of effort and equipment to LHC • Declining equipment and operations budgets are already limiting the data collection rate. • Over time, limits in the equipment and operating budget will create delays • Operational performance of user code • DO reconstruction code performance and release turn-around • CDF user code has caused inefficiencies on the CAF • COTS Computing • Experiments need best price/performance, which introduces risk. • Moore’s law • Have a good process in place for evaluation, purchase and acceptance. • Each purchase of worker nodes presents challenges • FNAL CD plays engineering/integrator role by default • Commodity fileservers are maintenance intensive Run II Computing Review
Risks, expanded • Data Handling • SAM system, dCache, hardware working well • User patterns are still evolving, sometimes conflicts between wanting to get results out and using standard production. • Scaling with data sample size might have unanticipated consequences. • Count on next generation tape drives to mitigate tape costs • Longevity of hardware components and software applications • Starting to use a 4 year replacement cycle for worker nodes so the equipment is off warranty the final year. • 5 year life cycle on major components, replacement needed again around 2010 when budget for Run II will be extremely limited. • Migrating either experiment from existing mode of operation or user interfaces would be time intensive and costly. Run II Computing Review