150 likes | 265 Views
NIKHEF Data Processing Fclty. Status Overview per 2004.10.27 David Groep, NIKHEF. A historical view. Started in 2000 with a dedicated farm for D Ø 50 Dual P3-800 MHz tower model Dell Precision 220 800 GByte “3ware” disk array. jobs. Many different farms.
E N D
NIKHEF Data Processing Fclty Status Overview per 2004.10.27 David Groep, NIKHEF NEROC-TECH NDPF status overview
A historical view • Started in 2000 with a dedicated farm for DØ • 50 Dual P3-800 MHz • tower model Dell Precision 220 • 800 GByte “3ware” disk array jobs NEROC-TECH NDPF status overview
Many different farms • 2001: EU DataGrid WP6 ‘Application’ test bed • 2002: addition of the ‘development’ test bed • 2003: LCG-1 production facility • April 2004: amalgamation of all nodes into LCG-2 • September 2004: addition of • EGEE PPS • VL-E P4 CTB • EGEE JRA1 LTB NEROC-TECH NDPF status overview
Growth of resources Intel Pentium III 800 MHz 100 CPUs 2000 Intel Pentium III 933 MHz 40 CPUs 2001 AMD Athlon MP2000+ ~2 GHz 132 CPUs 2002 Intel XEON 2.8 GHz 54 CPUs 2003 Intel XEON 2.8 GHz 20 CPUs 2003 Total WN resources (raw) 353 THz hr/mo ~200 kSI2k Total on-line disk cache 7 TByte NEROC-TECH NDPF status overview
Node types 2U “pizza” boxesPIII 933 MHz, 1GByte RAM, 43 Gbyte disk 1U GFRC (NCF)AMD MP2000+, 1GByte RAM, 60 Gbyte disk‘thermodynamic challenges’ 1U HalloweenXEON 2.8 GHz2GByte RAM, 80 Gbyte diskfirst GigE nodes NEROC-TECH NDPF status overview
Connecting things together • Collapsed backbone strategy • Foundry Networks BigIron 15000 • 14 GigE SX, 2x GigE LX • 16 1000BaseTX • 48 100BaseTX • Service nodes directly GigE connected • Farms connected via local switches • WN oversubscription typical 1:5 – 1:7 • Dynamic re-assignment of nodes to facilities • DHCP Relay • built-in NAT support (for worker nodes) NEROC-TECH NDPF status overview
NIKHEF Farm Network NEROC-TECH NDPF status overview
Network Uplinks NIKHEF links • 1 Gb/s IPv4 & 1 Gb/s IPv6 SURFnet • 2 Gb/s WTCW (to SARA) SURFnet links: NEROC-TECH NDPF status overview
NDPF Usage • Analyzed production batch logs since May 2002 • total of 1.94 PHzHours provided in 306 000 jobs Added “Halloween” LHC Data Challenges Added NCF GFRC experimental use and tests not shown NEROC-TECH NDPF status overview
Usage per Virtual Organisation Real-time web info: www.nikhef.nl/grid/ www.dutchgrid.nl/Org/Nikhef/farmstats.html • Dzero acts as “background fill” • Usage doesn’t (yet) reflect shares NEROC-TECH NDPF status overview
Usage monitoring • Live viewgraphs • farm occupancy • per-VO distribution • network loads • Tools • Cricket (network) • home-grown scripts + rrdtool NEROC-TECH NDPF status overview
Central services • VO-LDAP services LHC VOs • DutchGrid CA • “edg-testbed-stuff”: • Torque & Maui distribution • installation support components NEROC-TECH NDPF status overview
Some of the issues • Data access patterns in Grids • jobs tend to clutter $CWD • high load when shared over NFS • shared homes required for traditional batch & MPI • Garbage collection for “foreign” jobs • OpenPBS & Torque transient $TMPDIR patch • Policy management • maui fair-share policies • CPU capping • max-queued-jobs capping NEROC-TECH NDPF status overview
Developments: work in progress • Parallel Virtual File Systems • From LCFGng to Quattor (Jeff) • Monitoring and ‘disaster recovery’ (Davide) NEROC-TECH NDPF status overview
Team NEROC-TECH NDPF status overview