270 likes | 570 Views
Large scale, Grid-enabled, dCache-based Storage System and Service Challenge Practice at BNL. Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab Nov. 12-18 2005, SUPERCOMPUTING 2005, Seattle. Outline. Background
E N D
Large scale, Grid-enabled, dCache-based Storage System and Service Challenge Practice at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab Nov. 12-18 2005, SUPERCOMPUTING 2005, Seattle
Outline • Background • Large scale, Grid-enabled, dCache-based Disk Storage System at BNL • Features • System architecture • Usage of the system • Long-term plan • Service Challenge Practice
Background • BNL RHIC/ATLAS Computing Facility • The tier-1 computing center for USATLAS • Huge goals • The operation of a persistent, production-quality Grid capable of marshalling computing resources and data resources for USATLAS project. • Challenge for providing storage services • Local and grid-based access to very large datasets in a reliable, cost-efficient and high-performance manner.
Solution for Grid-enabled Storage Element at BNL • Software • dCache (a product of DESY and FNAL) • Free • Hybrid hardware solution – Cost efficient • Majority of dCache servers share resources with a large amount of worker nodes on Linux Farm. • Utilize idle disks on worker node • Each worker node acts as both storage and compute node. • Dedicated good servers for a small amount of servers (which are more critical)
USATLAS dCache system at BNL • A large grid-enabled storage element in productive-quality • Very large disk Cache (dCache) system: • 336 servers, 150TB disk space. • Providing services to store and access very large amounts of data sets for all local and grid ATLAS users. • In production service since November 2004. • Reliable, cost-efficient and high-performance manner • Grid-enabled (SRM, GSIFTP) Storage Element in the context of OSG and LCG
USATLAS dCache system at BNL (Cont.) • Features • Distributed disk caching system as a front-end for Mass Storage System • High performance • Reliability • Support of various access protocols • Cost efficient solution • Scalability • Flexible system tuning
Distributed disk caching system • Distributed disk caching system as a front-end for Mass Storage System (BNL HPSS). • Simulating an “infinite” space with tape as backend • Allows transparent access to large amount of data files distributed on disk pools or stored on tape. • Provides the users with one unique name-space for all the data files. • Clever selection mechanism to determine whether the file is already stored on one or more disks or on tape.
High performance • High performance data I/O throughput. • Direct client – disk (pool) and disk (pool) – tape (HPSS) connection. • High aggregated data I/O • Significantly improves the efficiency of connected tape storage systems, through caching, i.e. gather & flush, and scheduled staging techniques. • Optimized backend tape prestage batch system.
Reliability • Load balanced and fault tolerant • Automatic load balancing using cost metric and inter pool transfers. • Dynamically replicate files upon detection of hot spot. • Allow multiple distributed administrative servers for each type • e.g., read pools, write pools, access points (doors -- DCAP doors, SRM doors, GridFTP doors).
Support of various access protocols • Local access protocol: DCAP (posix like) • GsiFTP data transfer protocol • Secure Wide Area data transfer protocol • Storage Resource Manager Protocol (SRM) - Provide SRM based storage element
Cost efficient • Free software • Hybrid hardware model • Majority of dCache servers share resources with a large amount of worker nodes on Linux Farm. • Utilizing low-cost, locally-mounted disk space on the computing farm • Dedicated good servers for a small amount of servers (which are more critical)
Scalability • High Scalability • Distributed Movers and Access Points (Doors) • Highly distributed Storage Pools • Direct client – disk (pool) and disk (pool) – tape (HPSS) connection.
Flexible system tuning • The system determines the source or destination storage pool based on • storage group • network mask of clients • I/O direction • “CPU” load • disk space • configuration
System architecture • see the next slide
GridFTP Clients DCap Clients SRM Clients Data Channel Control Channel Oak Ridge Batch system DCap doors GridFTP doors SRM door Write pools Read pools HPSS Pnfs Manager Pool Manager DCache System
Usage of the system • Total amount of datasets (only production data counted) • 110TB Production data stored (as of 11/03/2005) • Exhibiting high performance during a series of Service Challenges and US ATLAS production runs.
Users and use pattern • Clients from BNL on-site • Local analysis application • Grid production jobs submitted to BNL • Other on-site users • Off-site grid users • GridFTP clients • Grid production jobs submitted to remote sites • Other grid users • SRM clients
Long-term plan • To build petabyte-scale grid-enabled storage element • Use petabyte-scale disk space on thousands of farm nodes to hold most recently used data in disk. • Altas experiment run will generate data volumes each year on the petabyte scale. • HPSS as tape backup for all data.
Long-term plan (Cont.) • dCache as grid-enabled distributed storage element solution. • Issues need to be investigated • Is dCache scalable to very large clusters (thousands of nodes)? • Will network I/O be a bottleneck for a very large cluster? • Monitoring and administration of petabyte scale disk storage system.
Service challenge • Service Challenge • To test the readiness of the overall computing system to provide the necessary computational and storage resources to exploit the scientific potential of the LHC machine. • SC2 • Disk-to-disk transfer from CERN to BNL • SC3 throughput phase • Disk-to-disk transfer from CERN to BNL • Disk-to-tape transfer from CERN to BNL • Disk-to-disk transfer from BNL to Tier-2 centers
SC2 at BNL • Testbed dCache • Four dCache pool servers with 1 Gigabit WAN network connection. • Meet the performance/throughput challenges (disk-to-disk transfer rate at 70~80MB/sec from CERN to BNL).
SC3 throughput phase • Steering: FTS Control: SRM Transfer protocol:GridFTP • Production dCache system was used with network upgrade to 10 Gbps between USATLAS storage system and BNL BGP router • Disk-to-disk transfer from CERN to BNL • Achieved rate at 100~120MB/sec with peak rate at 150MB/sec (sustained for one week) • Disk-to-tape transfer from CERN to BNL HPSS • Achieved Rate: 60MB/sec (sustained for one week) • Disk-to-disk transfer testing from BNL to tier-2 centers • tier-2 centers: BU, UC, IU, UTA • Aggregated transfer rate at 30MB~40MB/sec
Links • BNL dCache user guide website • http://www.atlasgrid.bnl.gov/dcache/manuals/ • USATLAS tier-1 & tier-2 dCache systems. • http://www.atlasgrid.bnl.gov/dcache_admin/ • USATLAS dCache workshop • http://agenda.cern.ch/fullAgenda.php?ida=a055146