Jon Wakelin, Physics & ACRC

Jon Wakelin, Physics & ACRC Bristol

ACRC • Server Rooms • PTR – 48 APC water cooled racks (Hot aisle, cold aisle) • MVB – 12 APC water cooled racks (Hot aisle, cold aisle) • HPC • IBM, ClusterVision, ClearSpeed. • Storage • 2008-2011? • Petabyte scale facility • 6 Staff • 1 Director, 2 HPC Admins, 1 Research Facilitator • 1 Visualization Specialist, 1 e-Research Specialist • (1 Storage admin post?)

ACRC Resources • Phase 1 – ~ March 07 • 384 core - AMD Opteron 2.6 Ghz dual-socket dual-core system, 8GB Mem. • MVB server room • CVOS and SL 4 on WN. GPFS, Torque/Maui, QLogic InfiniPath • Phase 2 - ~ May 08 • 3328 core - Intel Harpertown 2.8Ghz dual-socket, quad-core, 8GB Mem. • PTR server room - ~600 meter from MVB server room. • CVOS and SL? WN. GPFS, Torque/Moab, QLogic InfiniPath • Storage Project (2008 - 2011) • Initial purchase of additional 100 TB for PP and Climate Modelling groups • PTR server room • Operational by ~ sep 08. • GPFS will be installed on initial 100TB.

ACRC Resources • 184 Registered Users • 54 Projects • 5 Faculties • Eng • Science • Social Science • Medicine & Dentistry • Medical & Vet.

PP Resources • Initial LCG/PP setup • SE (DPM), CE and 16 core PP Cluster, MON and UI • CE for HPC (and SE and GridFTP servers for use with ACRC facilities) • HPC Phase 1 • PP have a 5% target fair-share, and up to 32 concurrent jobs • New CE, but uses existing SE - accessed via NAT (and slow). • Operational since end of Feb 08 • HPC Phase 2 • SL 5 will limit PP exploitation in short term. • Exploring Virtualization – but this is a medium- to long-term solution • PP to negotiate larger share of Phase 1 system to compensate • Storage • 50TB to arrive shortly, operational ~ Sep 08 • Additional networking necessary for short/medium-term access.

Storage • Storage Cluster • Separate to HPC cluster • Will run GPFS • Being installed and configure ‘as we speak’ • Running a ‘test’ Storm SE • This is the second time • Due to changes in the underlying architecture • Passing simple SAM SE tests • But, now removed from BDII • Direct access between storage and WN • Through multi-cluster GPFS (rather than NAT) • Test and Real system may differ in the following ways… • Real system will have a separate GridFTP server • Possibly NFS export for Physics Cluster • 10Gb NICs (Myricom Myri10G PCI-Express)

5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-24 5510-24 8648 GTR 8683 8683 8683 8683 8683 5530 5530 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 HPC Phase 2 x3650 + Myri-10G x3650 + Myri-10G x3650 + Myri-10G x3650 + Myri-10G Storage PTR Server Room 5510-48 NB: All components are Nortel 5510-48 5510-48 HPC Phase 1 MVB Server Room

5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-48 5510-24 5510-24 8648 GTR 8683 8683 8683 8683 8683 5530 5510-48 5510-48 5510-48 5510-48 5530 5510-48 5510-48 5510-48 5510-48 HPC Phase 2 x3650 + Myri-10G x3650 + Myri-10G x3650 + Myri-10G x3650 + Myri-10G Storage PTR Server Room 5530 5510-48 NB: All components are Nortel 5510-48 5510-48 HPC Phase 1 MVB Server Room

SoC • Separation of Concerns • Storage/Compute managed independently of Grid Interfaces • Storage/Compute managed by dedicatedHPC experts. • Tap into storage/compute in the manner the ‘electricity grid’ analogy suggested • Provide PP with centrally managed compute and storage • Tarball WN install on HPC • Storm writing files to a remote GPFS mount (devs. and tests confirm this) • In theory this is a good idea - in practice it is hard to achieve • (Originally) implicit assumption that admin has full control over all components • Software now allows for (mainly) non-root installations • Depend on others for some aspects of support • Impact on turn-around times for resolving issues (SLAs?!?!!)

General Issues • Limit the number of task that we pass on to HPC admins • Set up user, ‘admin’ accounts (sudo) and shared software areas • Torque - allow remote submission host (i.e. our CE) • Maui – ADMIN3 access for certain users (All users are A3 anyway) • NAT • Most other issues are solvable with less privileges • SSH Keys • RPM or rsync for Cert updates • WN tarball for software • Other issues • APEL accounting assumes ExecutingCE == SubmitHost (Bug report) • Work around for Maui client - key embedded in binaries!!! (now changed) • Home dir path has to be exactly the same on CE and Cluster. • Static route into HPC private network

Q’s? • Any questions… • https://webpp.phy.bris.ac.uk/wiki/index.php/Grid/HPC_Documentation • http://www.datadirectnet.com/s2a-storage-systems/capacity-optimized-configuration • http://www.datadirectnet.com/direct-raid/direct-raid • hepix.caspur.it/spring2006/TALKS/6apr.dellagnello.gpfs.ppt

Jon Wakelin, Physics & ACRC