130 likes | 143 Views
Liverpool HEP - Site Report June 2008. Robert Fay, John Bland. Staff Status. One members of staff left in the past year: Paul Trepka, left March 2008 Two full time HEP system administrators John Bland, Robert Fay One full time Grid administrator currently being hired
E N D
Liverpool HEP - Site Report June 2008 Robert Fay, John Bland
Staff Status • One members of staff left in the past year: • Paul Trepka, left March 2008 • Two full time HEP system administrators • John Bland, Robert Fay • One full time Grid administrator currently being hired • * Closing date for applications was Friday 13th, 15 applications received • One part time hardware technician • Dave Muskett
Current Hardware • Desktops • ~100 Desktops: Scientific Linux 4.3, Windows XP • Minimum spec of 2GHz x86, 1GB RAM + TFT Monitor • Laptops • ~60 Laptops: Mixed architectures, specs and OSes. • Batch Farm • Software repository (0.7TB), storage (1.3TB) • Old ‘batch’ queue has 10 SL3 dual 800MHz P3s with 1GB RAM • ‘medium’, ‘short’ queues consist of 40 SL4 MAP-2 nodes (3GHz P4s) • 5 interactive nodes (dual Xeon 2.4GHz) • Using Torque/PBS • Used for general analysis jobs
Current hardware – continued • Matrix • 1 dual 2.40GHz Xeon, 1GB RAM • 6TB RAID array • Used for CDF batch analysis and data storage • HEP Servers • * 4 core servers • User file store + bulk storage via NFS (Samba front end for Windows) • Web (Apache), email (Sendmail) and database (MySQL) • User authentication via NIS (+Samba for Windows) • Dual Xeon 2.40GHz shell server and ssh server • Core servers have a failover spare
Current Hardware - continued • LCG Servers • CE, SE upgraded to new hardware: • CE now 8-core Xeon 2 GHz, 8GB RAM • SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10 array • CE, SE, UI all SL4, GLite 3.1 • Mon still SL3, GLite 3.0 • BDII SL4, Glite 3.0
Current Hardware – continued • MAP2 Cluster • 24 rack (960 node) (Dell PowerEdge 650) cluster • 4 racks (280 nodes) shared with other departments • Each node has 3GHz P4, 1GB RAM, 120GB local storage • 19 racks (680 nodes) primarily for LCG jobs (5 racks currently allocated for local ATLAS/T2K/Cockcroft batch processing) • 1 rack (40 nodes) for general purpose local batch processing • Front end machines for ATLAS, T2K, Cockcroft • Each rack has two 24 port gigabit switches • All racks connected into VLANs via Force10 managed switch
Storage • RAID • All file stores are using at least RAID5. Newer servers using RAID6. • All RAID arrays using 3ware 7xxx/9xxx controllers on Scientific Linux 4.3. • Arrays monitored with 3ware 3DM2 software. • File stores • New User and critical software store, RAID6+HS, 2.25TB • ~10B general purpose ‘hepstores’ for bulk storage • 1.4TB + 0.7TB batchstore+batchsoft for the Batch farm cluster • 1.4TB hepdata for backups • 37TB RAID6 for LCG storage element
Storage (continued) • 3ware Problems! • 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. • 3w-9xxx: scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence. • 3w-9xxx: scsi0: AEN: ERROR: (0x04:0x005F): Cache synchronization failed; some data lost:unit=0. • Leads to total loss of data access until system is rebooted. • Sometimes leads to data corruption at array level. • Seen under iozone load, normal production load, due to drive failure. • Anyone else seen this?
Network • Topology MAP2 2GB WAN 2GB Force10 Gigabit Switch firewall LCG servers Offices Servers 1GB link VLAN
Network (continued) • Core Force10 E600 managed switch. • Now have 450 gigabit ports (240 at line rate) • Used as central departmental switch, using VLANs • Increased bandwidth to WAN using link aggregation to 2-3GBit/s • Increased to departmental backbone to 2GBit/s • Added departmental firewall/gateway • Network intrusion monitoring with snort • Most office PCs and laptops are on internal private network • Building network infrastructure is creaking • - needs rewiring, old cheap hubs and • switches need replacing
Security & Monitoring • Security • Logwatch (looking to develop filters to reduce ‘noise’) • University firewall + local firewall + network monitoring (snort) • Secure server room with swipe card access • Monitoring • Core network traffic usage monitored with ntop and cacti (all traffic to be monitored after network upgrade) • Use sysstat on core servers for recording system statistics • Rolling out system monitoring on all servers and worker nodes, using SNMP, Ganglia, Cacti, and Nagios • Hardware temperature monitors on water cooled racks, to be supplemented by software monitoring on nodes via SNMP. Still investigating other environment monitoring solutions.
System Management • Puppet used for configuration management • Dotproject used for general helpdesk • RT integrated with Nagios for system management • - Nagios automatically creates/updates tickets on acknowledgement • - Each RT ticket serves as a record for an individual system
Plans • Additional storage for the Grid • GridPP3 funded • Will be approx. 60? TB • May switch from dCache to DPM • Upgrades to local batch farm • Plans to purchase several multi-core (most likely 8-core) nodes • Collaboration with local Computing Services Department • Share of their newly commissioned multi-core cluster available