190 likes | 340 Views
Duncan Rand Royal Holloway, University of London Brunel University. Review of WLCG Tier-2 Workshop. ....from the perspective of a Tier-2 system manager Workshop 3 days – lectures from experiments Tutorial 2 days – parallel programme Lots of talks with lots of detail!
E N D
Duncan Rand Royal Holloway, University of London Brunel University Review of WLCG Tier-2 Workshop
....from the perspective of a Tier-2 system manager • Workshop 3 days – lectures from experiments • Tutorial 2 days – parallel programme • Lots of talks with lots of detail! • General overview - refer to original slides for details • Oriented towards ATLAS (RHUL) and CMS (Brunel)
What did I expect? • An overview of the future • the big picture • more details about the experiments • data flows and rates • how were they going to use the Tier-2 sites? • what did they expect from us? • Perhaps, a tour of the LHC or an experiment
What do the experiments have in common? • Large volume of data to analyse (we knew that) • Need to distribute data to CPU’s, keep track of it, analyse it and upload results • However, also need to run lots of Monte Carlo (MC) jobs • common to all particle physics experiments • large fraction of all jobs run (ATLAS:1/3; CMS:1/2) • submitted from a central server – 'production' • explains mysterious 'prd' users e.g. lhcbprd running on our Tier-2 now
What do they do in Monte Carlo production? • Start with small dataset (KB) with initial conditions describing experiment • Model experiment from collision to analysis • Model proton-proton interactions, detector physics etc.. • CPU intensive; about 10 kSI2k hours • Upload larger data-set to Tier-1 at the end • Relatively low network demands; steady data flow from Tier-2 to Tier-1 of about 50Mbit/s (varies for each expt.)
Data Management • Data is immediately transferred from Tier-0 to Tier-1's for backup • RAW data is first calibrated and reconstructed to give Event Summary Data (ESD) and Analysis Object Data (AOD) suitable for analysis • AOD data sets transferred to Tier-2's for analysis – ‘bursty’ depending on user needs, ~300 Mbit/s (varies for each expt.) • Tier-1’s will provide reliable storage of data • Tier-2’s act more like dynamic cache • Tier-1’s handle more or less of essential services such as file catalogues, FTS services etc.
Computing • Experiments have developed complex software tools to: • handle all this data transfer and keep track of datasets (CMS:PhEDEx, ATLAS: DDM) • handle submission of MC production (CMS: ProdManager/ProdAgent) • direct jobs to where the datasets are • enable physicist in office to carry out ‘chaotic user analysis’ (doesn’t describe their mode of work, more the lack of central submission of jobs) (CMS:CRAB) • these make more or less demands on a site
ALICE • Alice - not highly relevant to UK as only supported by Birmingham at Tier-2 level • Distinction between Tier-1 and Tier-2 is by Quality of Service • Require extra VO box installed at a site; unlikely to use non-Alice Tier-2's opportunistically? • Developing ‘parallel root facility’ (PROOF) clusters at Tier-2’s for faster interactive data analysis
LHCb • Not going to use Tier-2's for analysis of data – concentrate analysis at Tier-1 • Only going to run Monte Carlo jobs at Tier-2's • Simplifies data transfer requirements at Tier-2 level • So, easiest for a Tier-2 to support • Low networking demands: 40Mbit/s aggregated over all Tier-2’s • UKI-LT2-Brunel (100 Mbit/s) recently in top 10 providers for LHCb Monte Carlo
ATLAS • Tier-2's provide 40% of total computing and storage requirements • Hierarchical structure between Tier-1's and Tier-2's • a Tier-1 provides services (FTS, LFC) to group of Tier-2's • no extra services required at Tier-2 level • Tier-2's will carry out MC simulations - results sent back to Tier-1's for storage and further distribution and processing – steady 30Mbit/s from site • AOD (analysis object data) will be distributed to Tier-2's for analysis: 160Mbit/s to site • SC4: how long to analyse 150TB data equivalent to 1 year running of LHC?
CMS • CPU intensive processing mostly carried out at Tier-2’s • Tier-2’s run 50% MC and 50% analysis jobs • MC production jobs handled by central queue called ‘ProductionManager’ • submit, track jobs and register output in CMS databases • jobs handed to ProductionAgents for processing • MC job output does not go from WN to Tier-1 directly • data is stored locally and small files are merged together by new jobs (heavy I/O) • large file (~TB) returned to Tier-1
CMS • Importance of good LAN bandwidth from WN’s to SE to do this merging of files • Use ‘CRAB’ (CMS Remote Analysis Builder) at a UI to analyse data • User specifies dataset • CRAB ‘discovers’ data, prepares job and submits it • ‘Surviving the first years’; until detector is understood AOD’s not that useful - will rely heavily on raw data – large networking demands
CMS: requirements of Tier-2 site • Division of labour • CMS look after global issues • Tier-2 look after local issues to keep site running • What is required: • a good batch farm with reliable storage • good LAN and WAN networking • install PhEDEx, LFC and Squid cache (calibration data) • pass Site Functional Tests • a good Tier2 is ‘active, responsive, attentive, proactive’
Support and operations afternoon • Discovered that WLCG = EGEE + OSGi.e. we are now working more closely with the US Open Science Grid • OSG not too relevant for the average Tier-2 sys-admin in UK
UKI ROC meeting • Small room, face to face meeting – lots of discussion • Grumbles about GGUS tickets and time taken to close solved ticket • Close it yourself add ‘status=solved’ to first line of reply • Highlighted for me the somewhat one-directional flow of information in the workshop itself • Would have been good for Tier-2’s to have been able to present at the workshop
Middleware tutorials • Popular – lots of discussion • Understandable given fact that Tier-2 system admins more interested in middleware than experimental computing models • Good to be able to hear roadmap for LFC, DPM, FTS, SFT’s etc. from middleware developers and ask questions
Tier-2 interaction • Didn't appear to be much interaction between Tier-2's • Lack of name badges? • Missed chance to find out how others do things • Michel Jouvin from GRIF (Paris) gave a summary of his survey on Tier-2’s • large variation between resources at Tier-2’s • 1 to 8 sites per Tier-2; 1 to 13 FTE! • Difference between distributed vs. federated Tier-2’s? • Post-workshop survey excellent idea
Providing a Service • We are the users and customers of the middleware • Tier-2 providing a service for experiments • CMS: ‘Your customers will be remote users’ • Tier-2's need to generate a customer service mentality • Need good communication paths to ensure this works well • CMS have VRVS integration meetings and email list – sounds promising • Not very clear how other experiments will communicate pro- actively
Summary • Learnt a lot about how the experiments intend to use Tier-2's • Pretty clear about what they need from Tier-2 sites • Could have been more feedback from Tier-2’s • Could have been more interaction between Tier-2’s • Tier-2’s are critical to success of LHC: service mentality • Communication between experiments and Tier-2’s unclear The LHC juggernaut is changing up a gear !