Large scale, Grid-enabled, dCache-based Storage System and Service Challenge Practice at BNL

Large scale, Grid-enabled, dCache-based Storage System and Service Challenge Practice at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab Nov. 12-18 2005, SUPERCOMPUTING 2005, Seattle

Outline • Background • Large scale, Grid-enabled, dCache-based Disk Storage System at BNL • Features • System architecture • Usage of the system • Long-term plan • Service Challenge Practice

Background • BNL RHIC/ATLAS Computing Facility • The tier-1 computing center for USATLAS • Huge goals • The operation of a persistent, production-quality Grid capable of marshalling computing resources and data resources for USATLAS project. • Challenge for providing storage services • Local and grid-based access to very large datasets in a reliable, cost-efficient and high-performance manner.

Solution for Grid-enabled Storage Element at BNL • Software • dCache (a product of DESY and FNAL) • Free • Hybrid hardware solution – Cost efficient • Majority of dCache servers share resources with a large amount of worker nodes on Linux Farm. • Utilize idle disks on worker node • Each worker node acts as both storage and compute node. • Dedicated good servers for a small amount of servers (which are more critical)

USATLAS dCache system at BNL • A large grid-enabled storage element in productive-quality • Very large disk Cache (dCache) system: • 336 servers, 150TB disk space. • Providing services to store and access very large amounts of data sets for all local and grid ATLAS users. • In production service since November 2004. • Reliable, cost-efficient and high-performance manner • Grid-enabled (SRM, GSIFTP) Storage Element in the context of OSG and LCG

USATLAS dCache system at BNL (Cont.) • Features • Distributed disk caching system as a front-end for Mass Storage System • High performance • Reliability • Support of various access protocols • Cost efficient solution • Scalability • Flexible system tuning

Distributed disk caching system • Distributed disk caching system as a front-end for Mass Storage System (BNL HPSS). • Simulating an “infinite” space with tape as backend • Allows transparent access to large amount of data files distributed on disk pools or stored on tape. • Provides the users with one unique name-space for all the data files. • Clever selection mechanism to determine whether the file is already stored on one or more disks or on tape.

High performance • High performance data I/O throughput. • Direct client – disk (pool) and disk (pool) – tape (HPSS) connection. • High aggregated data I/O • Significantly improves the efficiency of connected tape storage systems, through caching, i.e. gather & flush, and scheduled staging techniques. • Optimized backend tape prestage batch system.

Reliability • Load balanced and fault tolerant • Automatic load balancing using cost metric and inter pool transfers. • Dynamically replicate files upon detection of hot spot. • Allow multiple distributed administrative servers for each type • e.g., read pools, write pools, access points (doors -- DCAP doors, SRM doors, GridFTP doors).

Support of various access protocols • Local access protocol: DCAP (posix like) • GsiFTP data transfer protocol • Secure Wide Area data transfer protocol • Storage Resource Manager Protocol (SRM) - Provide SRM based storage element

Cost efficient • Free software • Hybrid hardware model • Majority of dCache servers share resources with a large amount of worker nodes on Linux Farm. • Utilizing low-cost, locally-mounted disk space on the computing farm • Dedicated good servers for a small amount of servers (which are more critical)

Scalability • High Scalability • Distributed Movers and Access Points (Doors) • Highly distributed Storage Pools • Direct client – disk (pool) and disk (pool) – tape (HPSS) connection.

Flexible system tuning • The system determines the source or destination storage pool based on • storage group • network mask of clients • I/O direction • “CPU” load • disk space • configuration

System architecture • see the next slide

GridFTP Clients DCap Clients SRM Clients Data Channel Control Channel Oak Ridge Batch system DCap doors GridFTP doors SRM door Write pools Read pools HPSS Pnfs Manager Pool Manager DCache System

Usage of the system • Total amount of datasets (only production data counted) • 110TB Production data stored (as of 11/03/2005) • Exhibiting high performance during a series of Service Challenges and US ATLAS production runs.

Users and use pattern • Clients from BNL on-site • Local analysis application • Grid production jobs submitted to BNL • Other on-site users • Off-site grid users • GridFTP clients • Grid production jobs submitted to remote sites • Other grid users • SRM clients

Long-term plan • To build petabyte-scale grid-enabled storage element • Use petabyte-scale disk space on thousands of farm nodes to hold most recently used data in disk. • Altas experiment run will generate data volumes each year on the petabyte scale. • HPSS as tape backup for all data.

Long-term plan (Cont.) • dCache as grid-enabled distributed storage element solution. • Issues need to be investigated • Is dCache scalable to very large clusters (thousands of nodes)? • Will network I/O be a bottleneck for a very large cluster? • Monitoring and administration of petabyte scale disk storage system.

Service challenge • Service Challenge • To test the readiness of the overall computing system to provide the necessary computational and storage resources to exploit the scientific potential of the LHC machine. • SC2 • Disk-to-disk transfer from CERN to BNL • SC3 throughput phase • Disk-to-disk transfer from CERN to BNL • Disk-to-tape transfer from CERN to BNL • Disk-to-disk transfer from BNL to Tier-2 centers

SC2 at BNL • Testbed dCache • Four dCache pool servers with 1 Gigabit WAN network connection. • Meet the performance/throughput challenges (disk-to-disk transfer rate at 70~80MB/sec from CERN to BNL).

One day data transfer of SC2

SC3 throughput phase • Steering: FTS Control: SRM Transfer protocol:GridFTP • Production dCache system was used with network upgrade to 10 Gbps between USATLAS storage system and BNL BGP router • Disk-to-disk transfer from CERN to BNL • Achieved rate at 100~120MB/sec with peak rate at 150MB/sec (sustained for one week) • Disk-to-tape transfer from CERN to BNL HPSS • Achieved Rate: 60MB/sec (sustained for one week) • Disk-to-disk transfer testing from BNL to tier-2 centers • tier-2 centers: BU, UC, IU, UTA • Aggregated transfer rate at 30MB~40MB/sec

SC3

Top daily averages for dCache sites

Links • BNL dCache user guide website • http://www.atlasgrid.bnl.gov/dcache/manuals/ • USATLAS tier-1 & tier-2 dCache systems. • http://www.atlasgrid.bnl.gov/dcache_admin/ • USATLAS dCache workshop • http://agenda.cern.ch/fullAgenda.php?ida=a055146

Large scale, Grid-enabled, dCache-based Storage System and Service Challenge Practice at BNL

Large scale, Grid-enabled, dCache-based Storage System and Service Challenge Practice at BNL

Presentation Transcript

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Implementing Evidence- Based Practice on a Large Scale

The Encyclopedia of Life Project Large Scale Grid-enabled Proteome Annotation

Large-scale Messaging at IMVU

Large Scale Visual Recognition Challenge 2011

Large Scale Parallel Print Service

Semantically-enabled (large-scale) Scientific Data Integration (SESDI)

Large Scale Grid Infrastructures: Status and Future

quattor Framework for Managing Grid-enabled Large Scale Computing Fabrics

Comparing Large Scale Storage Technologies

dCache at Tier3

Flexible QoS-Based Service Selection in Large-Scale Grids

GRID-enabled HDF5

Large Scale Visual Recognition Challenge 2011

Challenge in Building National Scale Grid Infrastructure

Large-Scale System Partitioning

Ultra-Large-Scale (ULS) System

DS-Grid: Large Scale Distributed Simulation on the Grid

Global Grid-scale Energy Storage System MarketSize 2014-2018

Global Grid-scale Energy Storage System Market 2015-2019

Experience of Development and Deployment of a Large-Scale Ceph-Based Data Storage System at RAL

Large Scale Parallel Print Service