Northgrid

Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011 Northgrid

Efficiency The ratio of the effective or useful output to the total input in any system.

Pledges

CPU efficiency

Usage NorthGrid Normalised CPU time (HEPSPEC06) by SITE and VO. TOP10 VOs (and Other VOS). September 2010 - February 2011.

Successful jobs rate UKI-NORTHGRID-MAN-HEP (494559/35496) UKI-NORTHGRID-LANCS-HEP (252864/33889) UKI-NORTHGRID-LIV-HEP (227395/15185) UKI-NORTHGRID-SHEF-HEP (140804/8525) ANALY_MANC (192046/52963) ANALY_LANCS (155233/61368) ANALY_SHEF (161043/25537) ANALY_LIV (146994/21563)

Our main strategy for efficient running at Lancaster involves comprehensive monitoring and configuration management. Effective monitoring allows us to jump on incidents and spot problems before they bite us on the backside, as well as enabling us to better understand, and therefore tune, our systems. Cfengine on our nodes, and kusu on the HEC machines, enables us to pre-empt misconfiguration issues on individual nodes, quickly ratify errors and ensure swift, homogenous rollout of configs and changes. Whatever the monitoring, e-mail alerts keep us in the know. Among the many tools and tactics we use to keep on top of things are: Syslog (with Logwatch mails) , Ganglia, Nagios (with e-mail alerts), Atlas Panda Monitoring, Steve’s Pages, on-board monitoring and e-mail alerts for our Areca raid arrays, Cacti for our network (and the HEC nodes), plus a whole bunch of hacky scripts and bash one-liners!zzzz Lancaster – keeping things smooth

We’ll probably never stop finding things to polish, but some things that are at are on top of the wishlist (in that we wish we could get time to implement them!) are: A site dashboard (a huge, beautiful site dashboard) More ganglia metrics! And more in-depth nagios tests, particularly for batch system monitoring and raid monitoring (recent storage purchases have 3ware and Adaptec raids). Intelligent syslog monitoring as the number of nodes at our site grow. Increased network and job monitoring, the more detailed the picture we have of what’s going on the better we can tune things. Other ideas for increasing our efficiency include SMS alerts, internal ticket management and introducing a more formalised on-call system. Lancaster – TODO list

Liverpool hardware measures Planning, design and testing - Storage and node specifications Network design, e.g. • Minimise contention • Bonding • Extensive HW and SW soak testing, experimentation, tuning • Adjustments and refinement • UPS coverage

Liverpool Building and monitoring • Builds and maintenance - • dhcp, kickstart, yum, puppet, yaim, standards • Monitoring - nagios (local and gridpp), ganglia, cacti/weathermap, log monitoring, tickets and mail lists. testnodes – local software that checks worker-nodes to isolate potential “blackhole” conditions.

Manchester install & config & monitor Have to look after ~550 machines Install Dhcp, Kickstart, YAIM, Yum, Cfengine Monitor Nagios (ganglia), cfengine, weathermap, raid cards monitoring, custom scripts to parse log files, OS tools. Each machines has a profile for each tool Difficult to keep consistent changes Manpower reduced can't afford this bad tracking

Use Nagios for monitoring nodes and services • Both external tests (eg ssh to port) • And internal tests (via node's nrpe daemon) • Use RT (“Request Tracker”) for tickets • Includes Asset Tracker which has a powerful as has a web interface and links to tickets Manchester Integration with RT

Previously maintained lists of hosts and group membership in Nagios cfg files • Now make these from the AT MySQL DB • Obvious advantages in monitoring services only where cfengine has installed them • Automatic cross link between AT and nagios • Future extensions to other lists as dhcp, cfengine, online and offline nodes ManchesterIntegration with RT (2)

Sheffield: efficiency • 2 clusters • Jobs requiring better network bandwidth directed to WNs with better backbone • Storage • 90 TB (9 disk pools , SW RAID5 (without raid controllers)) • Absence of raid controllers increases site efficiency : • No common failures related to RAID controllers: • unavailable disk servers and data loss • 2 TB disks seagate barracuda disks, fast and robust • 5x16bay unit with 2 fs, 4x24 bay unit with 2 fs • Cold spare unit on standby in each server • Simple cluster structure makes it easy to support high efficiency and to upgrade it to new requirements of experiments

Sheffield:efficiency • Monitoring (checks are on a regular basis several times a day) • Ganglia: general check of the cluster health • Regional nagios, warnings sent via email from regional nagios • Logwatch/syslog check • GRIDMAP • All ATLAS monitoring tools : • ATLAS SAM test page • AtTLAS (and LHCb) Site Status Board • DDM dashboard • PANDA monitor • Detailed check of atlas performance (check the reason for a particular failure of production and analysis jobs)

Sheffield:efficiency • Installation • Use PXE boot • Redhat kickstart install • Using many cron jobs for monitoring • Bash post-install (includes yaim) • Cron jobs • Monitor the temperature in cluster room (in case of temperature raise only some of the worker nodes shut down automatically) • Generate a web page of queues and jobs for both grid and local • Check and restart of vital services if they are down (bdii, srm) • Generate a warning email in case of disk failure (in any server)

Northgrid

Northgrid

Presentation Transcript

NorthGrid

Northgrid Status