280 likes | 388 Views
Large Scale Computing at PDSF Iwona Sakrejda NERSC User Services Group ISakrejda@lbl.gov February ??, 2006. Outline. Role of PDSF in HENP computing. Integration with other NERSC computational and storage systems. User management and user oriented services at NERSC PDSF layout.
E N D
Large Scale Computing at PDSF Iwona Sakrejda NERSC User Services Group ISakrejda@lbl.gov February ??, 2006
Outline • Role of PDSF in HENP computing. • Integration with other NERSC computational and storage systems. • User management and user oriented services at NERSC • PDSF layout. • Workload management (batch systems) • File System implications of data intensive computing . • Operating system selection with CHOS. • Grid use at PDSF (Grid3, OSG, ITB) • Conclusions
PDSF Mission PDSF (Parallel Distributed Systems Facility) is a networked distributed computing environment used to meet the detector simulation and data analysis requirements of large scale High Energy Physics (HEP) and Nuclear Science (NS) experiments. 3
SNFactory SNO Alice PDSF Principle of Operation • Multiple groups pool their resources together • Need for resources varies through the year – conferences, data taking periods at different times (Quark Mater vs PANIC for example). • Peak resource availability enhanced. • Idle cycles minimized by allowing groups with small resources (cycle scavenging). • Software installation and license sharing (Totalview, IDL, PGI)
HPSS HPSS SGI PDSF at NERSC Analytics Server - DaVinci 32 Processors 192 GB Memory 25 Terabytes Disk HPSS IBM AIX Server 50 TB of cache disk 8 STK robots, 44,000 tape slots, max capacity 9 PB IBM POWER5 – Bassi 888 processors (peak: 6.7 Tflop/s) SSP - .8 Tflop/s 2 TB Memory 70 TB disk STK Robots Testbeds and servers FC Disk 10 gigabit ethernet Storage Fabric Opteron Cluster – Jacquard 640 processors (peak: 2.8 Tflop/s Opteron/Infiniband 4X/12X 3.1 TF/ 1.2 TB memory SSP - .41 Tflop/s 30 TB Disk PDSF~700 processors ~1.5 TF, .7 TB of Memory ~300 TB of Shared Disk Jumbo 10 Gigabit Ethernet Global Filesystem IBM POWER3 - Seaborg 6,080 processors (peak 9.1 TFlop/s) SSP – 1.35 Tflop/s 7.8 Terabyte Memory 55 Terabytes of Shared Disk
User Management and Support at NERSC • With >500 users and >10 projects a database management system needed. • Active user management (disabling, password expiration…) • Allocation management (especially mass storage accounting) • PIs partly responsible for user management (from their own projects) • Adding users • Assigning users to groups • Removing users • Users managing their own info, groups, certificates…. • Account support • User Support and the trouble ticket system. • Call center • Trouble ticket system
pdsf.nersc.gov PDSF Layout ….. pdsf.nersc.gov Interactive nodes Pool of disk vaults Grid gatekeepers Batch pool – several generations of Intel and AMD processors ~1200 1GHz GPFS file systems HPSS
Workload Management (Batch) • Effective resource sharing via batch workload management • Fair share principle links shares to groups financial contributions • Fairness concept by groups and within groups • Concept at the heart of PDSF design • Unused resources split among running users • Group sharing places additional requirement on batch systems. • Impact of batch system • LSF good scalability, performance and documentation, met requirements, costly • Condor (concept of a group share not implemented when transition was considered – 2 years ago) • SGE met requirements, scales reasonably, documentation lacking at times • Changes minimized by SUMS (STAR)
Shares System at Work STAR’s 70% share “pushes out” KamLAND (9% share) SNO (1%, light blue), Majorana (no contribution) get time when the big share owners do not use it.
File System implications of data intensive computing - NFS • NFS – cost effective solution but • scales poorly • data corruption during heavy use • data safety (raidset helps but not 100%) • Disk vault are cheap IDE based centralized storage • Dvio batch-level “resource” integrated with the batch system • defined to limit number of simultaneous read/write access streams • hard to a priori asses load • Ganglia facilitates load monitoring and the dvio requirement assessment – available to the users..
Usage per discipline IO and data dominated by Nuclear Physics
File System implications of data intensive computing – local storage • Local storage on batch nodes • Cheap storage (large and cheap hard drives) • Very good I/O performance • Limited to jobs running on the node • Diversity of the user population does not facilitate batch node sharing • users wary of Xrootd daemons • No redundancy, drive failure causes data loss • File catalog aids in job submission – SUMS does the rest
File System implications of data intensive computing - GPFS • NERSC purchased GPFS software licenses for PDSF • Reliable (raid underneath) • Good performance (striping) • Self repairing • Even after disengaging under load comes back on-line • compare with “NFS stale file handles” (had to be fixed by either admin or a cron job) • Expensive • PDSF hosts will host several GPFS file systems • 7 already in place • ~15TB/filesystem – not enough experience with GPFS on linux
File System implications of data intensive computing – beta testing • file system (open software version) testing • File system performed reasonably well under high load • support and maintenance manpower intensive • Storage units from commercial vendors made available for beta testing • Support provided by vendors • Users get cutting edge, highly capable, storage appliances to use for extensive periods of time • Staff obliged to produce reports – additional workload (light) • Units too expensive to purchase – work related to data uploading • Affordable units from new companies – uncertainty of support continuity
Role of mass storage in data management • Data intensive experiments require “smart backup” • Only $HOME, system and application area are automatically backed up • PDSF storage media reliable – but not disaster-proof. • Groups have allocation in mass storage to selectively store their data • Users have individual accounts in mass storage to backup their work • Network bandwidth (10GB to HPSS) • large HPSS cache and large number of tape movers facilitate quick access to stored data • number of drives still an issue
Operating system selection with CHOS • PDSF is a secondary computing facility for most of the user groups • not free to independently select operating system • tied to the Tier0 selection • PDSF projects originated at various times (in the past or still to come) • Tier0s embraced different operating systems, evolution • PDSF accommodates needs of diverse groups with CHOS • framework for concurrently running multiple Linux environments (distributions) on single node. • accomplished through a combination of the chroot system call, a Linux kernel module, and some additional utilities. • can be configured so that users are transparently presented with their selected distribution on login.
Operating system selection with CHOS (cont) • Support for operating systems based on same kernel version. • RH7.2 • RH8 • RH9 • SL 3.0.2 • Base system – SL 3.03 • provides security • More info about CHOS available at:http://www.nersc.gov/nusers/resources/PDSF/chos/faq.php CHOS protected PDSF from fragmentation of resources – Unique approach to multi-group support. Sharing possible even when diverse OS required.
Who Has Used the Grid at NERSC • PDSF pioneered introduction of Grid services at NERSC. • Participation in the Grid3 project • Mostly PDSF (Parallel Distributed Systems Facility) users, who analyze detector data and simulations: • STAR Detector Simulations and Data Analysis • Studies the quark-gluon plasma and proton-proton collisions • 631 collaborators from 52 project institutions • 265 users at NERSC … • Simulations for the ALICE experiment at CERN • Studies ion-ion collisions • 19 NERSC users from 11 institutions • Simulations for the ATLAS experiment at CERN • Studies fundamental particle processes • 56 NERSC users from 17 institutions STAR ExperimentDetector
Caveats - Grid usage thoughts … • Most NERSC Users are not Using the Grid • The Office of Science “Massively Parallel Processing” (MPP) user communities have not embraced the grid • Even on the PDSF, only a few “production managers” use the grid; most users do not • Site policy side effects: • ATLAS and CMS stopped using the grid at NERSC due to lack of support for group accounts • Difficult/tedious/confusing to get a Grid certificate • Lack of support at NERSC for Virtual Organizations • One grid user’s opinion: instead of writing the middleware and troubleshooting just use a piece of paper to keep track of jobs and pftp for file transfers • However, several STAR users have been testing the Grid for user analysis jobs, so interest may be growing.
STAR Grid Computing at NERSC Grid computing benefits to STAR: • Bulk data transfer RCF->NERSC with Storage Resource Management (SRM) technologies • SRM automates end-to-end transfers: increased throughput and reliability; less monitoring effort by data managers • Source/destination can be files on disk or in HPSS mass storage system • 60 TB transferred in CY05 with automatic cataloging • Typical transfers are ~10k files, 5 days duration, 1 TB • Doubles STAR processing power since all data at two sites
STAR Grid Computing at NERSC (cont.) Grid computing benefits to STAR: • Grid-based job submission with STAR scheduler (SUMS) • Production grid jobs are running daily from RCF to PDSF • SUMS job xml job description -> • condor-g grid job submission -> • SGE submission to PDSF batch system • Uses SRMs for input and output file transfers • Handles catalog queries, job definitions, grid/local job submission, etc. • Underlying technologies largely hidden from user
STAR Grid Computing at NERSC (cont.) • Goal: use SUMS to run STAR user analysis and data mining jobs on OSG sites. Issues are: • Transparent packaging and distribution of STAR software on OSG non-STAR-dedicated sites • SRM services need to be deployed consistently at OSG sites (preferred) or deployed along with the jobs (how to do?) • Inconsistencies of inbound/outbound site policies • SUMS Generic interface adaptable to other VOs running on OSG offer community support
NERSC Contributions to the Grid • myproxy.nersc.gov • Users don’t have to scp their certs to different sites • Safely stores credentials; uses ssl • Anyone can use it from anywhere • myproxy-init –s myproxy.nersc.gov • myproxy-get-delegation • Part of VDT and OSG software distribution • Management of grid-map files • NERSC users put their certs into our NERSC Information Management system • They automatically get propagated to all NERSC resources • garchive.nersc.gov • GSI authentications added to the HPSS pftp client and server • Users can log in to HPSS using their grid certs • Software contributed to the HPSS consortium
Online Certification Services (in development) • Would allow users to use grid services without having to get a grid cert • myproxy-logon – s myproxy.nersc.gov • Generates a proxy cert on the fly • Built on top of PAM and Myproxy • Will use radius server to authenticate users • Radius is a protocol to securely send authentication and auditing information between sites • Can authenticate with LDAP, One Time Password or Grid cert • Could be used to federate sites
Audit Trail for Group Accounts (proposed development) • NERSC needs to trace back sessions and commands to individual users • Some projects need to set up a production environment managed by multiple users (who can then jointly manage the production jobs and data) • Build an environment that accepts multiple certs or multiple username/passwords for a single account • Keep logs that can associate PID/UIDs with the actual user • Provide audit trail that constructs the original authentication associated with the PID/UID
Conclusions • NERSC/PDSF is a fully resource sharing facility • Several storage solutions evaluated, lots of choices and some emerging trend (distributed file systems, IO balanced systems, …) • CPU shared based on financial contributions • Fully opportunistic (if not used, can be take by others) • NERSC will base its deployment decisions on science and user driven requirements • A lot of ongoing research in distributed computing technologies • NERSC can contribute to STAR/OSG efforts: • Auditing and login tracing tools • Online certification services (integrate LDAP, One Time Passwords and Grid certs) • Testbed for OSG software on HPC architectures • User Support