Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1

Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Felice.Rosso@cnaf.infn.it Workshop sul calcolo e reti INFN - Otranto -8-6-2006

Outline • CNAF-INFN Tier1 • FARM and GRID Monitoring • Local Queues Monitoring • Local and GRID accounting • Storage Monitoring and accounting • Summary

Introduction • Location: INFN-CNAF, Bologna (Italy) • one of the main nodes of GARR network • Computing facility for INFN HNEP community • Partecipating to LCG, EGEE, INFNGRID projects • Multi-Experiments TIER1 • LHC experiments (Alice, Atlas, CMS, LHCb) • CDF, BABAR • VIRGO, MAGIC, ARGO, Bio, TheoPhys, Pamela ... • Resources assigned to experiments on a yearly Plan.

The Farm in a Nutshell - SLC 3.0.6, LCG 2.7, LSF 6.1 - ~ 720 WNs LSF pool (~1580 KSI2K) • Common LSF pool: 1 job per logical cpu (slot) • MAX 1 process running at the same time per job • GRID and local submission allowed • On the same WN can run GRID and not GRID jobs • On the same queue can be submitted GRID and not GRID jobs • For each VO/EXP one or more queues • Since 24th of April 2005 2.700.000 jobs were executed on our LSF pool (~1.600.000 GRID) • 3 CEs (main CE 4 opteron dualcore, 24 GB RAM) + 1 CE gLite

SE CE LSF Wn1 WNn “Legacy” non-Grid Access LSF client Grid Grid Access UI UI UI Access to Batch system

Farm Monitoring Goals • Scalability to Tier1 full size • Many parameters for each WN/server • DataBase and Plots on Web Pages • Data Analysis • Report problems on Web Page(s) • Share data with GRID tools • RedEye: INFN-T1 tool monitoring • RedEye: simple local user. No root!

Tier1 Fabric Monitoring What do we get? • CPU load, status and jiffies • Ethernet I/O, (MRTG by net-boys) • Temperatures, RPM fans (IPMI) • Total and type of active TCP connections • Processes created, running, zombie etc • RAM and SWAP memory • Users logged in • SLC3 and SLC4 compatible

Tier1 Fabric Monitor

Local WN Monitoring • On each WN every 5 min (local crontab) infos are saved locally (<3KBytes --> 2-3 TCP packets) • 1 minute later a collector “gets” via socket the infos • “gets”: tidy parallel fork with timeout control • To get and save locally datas from 750 WN ~ 6 sec best case. 20 sec worst case (timeout knife) • Upgrade DataBase (last day, week, month) • For each WN --> 1 file (possibility of cumulative plots) • Analysis monitor datas • Local thumbnail cache creation (web clickable) • http://collector.cnaf.infn.it/davide/rack.php • http://collector.cnaf.infn.it/davide/analyzer.html

Web Snapshot CPU-RAM

Web Snapshot TCP connections

Web Snapshot users logged

Analyzer.html

FabricGRID Monitoring • Effort on exporting relevant fabric metrics to the Grid level e.g.: • # of active WNs, • # of free slots, • etc… • GridICE integration • Configuration based on Quattor • Avoid duplication of sensors on farm

Local Queues Monitoring • Every 5 minutes on batch manager is saved queues status (snapshot) • A collector gets the infos and upgrades the local database (same logic of farm monitoring) • Daily / Weekly / Monthly / Yearly DB • DB: Total and single queues • 3 classes of users for each queue • Plots generator: Gnuplot 4.0 • http://tier1.cnaf.infn.it/monitor/LSF/

Web Snapshot LSF Status

UGRID: general GRID user (lhcb001, lhcb030…) SGM: Software GRID Manager (lhcbsgm) OTHER: local user

UGRID: general GRID user (babar001, babar030…) SGM: Software GRID Manager (babarsgm) OTHER: local user

RedEye - LSF Monitoring • Real time slot usage  • Fast, few CPU power needed, stable, works on WAN  • RedEye simple user, not root  BUT… • all slots have the same weight (Future: Jeep solution) • Jobs shorter than 5 minutes can be lost SO: We need something good for ALL jobs. We need to know who and how uses our FARM. Solution: Offline parsing LSF log files one time per day (Jeep integration)

Job-related metrics From LSF log file we got the following non-GRID info: • LSF JobID, local UID owner of the JOB • “any kind of time” (submission, WCT etc) • Max RSS and Virtual Memory usage • From which computer (hostname) the job was submitted (GRID CE/locally) • Where the job was executed (WN hostname) • We complete this set with KSI2K & GRID infos (Jeep) • DGAS interface http://www.to.infn.it/grid/accounting/main.html • http://tier1.cnaf.infn.it/monitor/LSF/plots/acct/

Queues accounting report

Queues accounting report • KSI2K [WCT] May 2006, All jobs

Queues accounting report • CPUTime [hours] May 2006, GRID jobs

How we use KspecINT2K? • 1 slot → 1 job • http://tier1.cnaf.infn.it/monitor/LSF/plots/ksi/ • For each job:

KSI2K T1-INFN Story

Job Check and Report • Lsb.acct had a big bug! • Randomly: CPU-user-time = 0.00 sec • From bjobs -l <JOBID> correct CPUtime • Fixed by Platform at 25th of July 2005 • CPUtime > WCT? --> Possible Spawn • RAM memory: is a job on the right WN? • Is the WorkerNode a “black hole”? • We have a daily report (Web page)

Fabric and GRID monitoring • Effort on exporting relevant queue and job metrics to the Grid level. • Integration with GridICE • Integration with DGAS (done!) • Grid (VO) level view of resource usage • Integration of local job information with Grid related metrics. E.g.: • DN of the user proxy • VOMS extensions to user proxy • Grid Job ID

GRID ICE • Dissemination http://grid.infn.it/gridice • GridICE server (development with upcoming features) • http://gridice3.cnaf.infn.it:50080/gridice • GridICE server for EGEE Grid • http://gridice2.cnaf.infn.it:50080/gridice • GridICE server for INFN-Grid • http://gridice4.cnaf.infn.it:50080/gridice

GRID ICE • For each site check GRID services (RB, BDII, CE, SE…) • Check service--> Does PID exist? • Summary and/or notification • From GRID servers: Summary CPU and Storage resources available per site and/or per VO • Storage available on SE per VO from BDII • Downtimes

GRID ICE • Grid Ice as fabric monitor for “small” sites • Based on LeMon (server and sensors) • Parsing of LeMon flatfiles logs • Plots based on RRD Tools • Legnaro: ~70 WorkerNodes

GridICE screenshots

Jeep • General Purpose collector datas (push technology) • DB-WNINFO: Historical hardware DB (MySQL on HLR node). • KSI2K used by each single job (DGAS) • Job Monitoring (Check RAM usage in real time, efficiency history) • FS-INFO: Enough available space on volumes? • AutoFS: all dynamic mount-points are working? • Match making UID/GID --> VO

The Storage in a Nutshell • Different hardware (NAS, SAN, Tapes) • More than 300 TB HD, 130 TB Tape • Different access methods (NFS/RFIO/Xrootd/gridftp) • Volumes FileSystem: EXT3, XFS and GPFS • Volumes bigger than 2 TBytes: RAID 50 (EXT3/XFS). Direct (GPFS) • Tape access: CASTOR (50 TB of HD as stage) • Volumes management via Postgresql DB • 60 servers to export FileSystems to WNs

Storage at T1-INFN • Hierarchical Nagios servers to check services status • gridftp, srm, rfio, castor, ssh • Local tool to sum space used by VOs • RRD to plot (volume space total and used) • Binary and owner (IBM/STEK) software to check some hardware status. • Very very very difficult to interface owner software to T1 framework • For now: only e-mail report for bad blocks, disks failure and FileSystem failure • Plots: intranet & on demand by VO

Tape/Storage usage report

Summary • Fabric level monitoring with smart report is needed to ease management • T1 has already solution for 2 next years! • Not exportable due to man-power (no support) • Future at INFN? What is T2s man-power? • LeMon&Oracle? What is T2s man-power? • RedEye? What is T2s man-power? • Real collaboration is far from mailing list and phone conferences only

Fabric Monitor, Accounting, Storage and Reports experience at the INFN Tier1