110 likes | 246 Views
Magda Distributed Data Manager Prototype. Torre Wenaus BNL September 2001. ATLAS PPDG Program. Principal ATLAS Particle Physics Data Grid deliverables:
E N D
Magda Distributed Data Manager Prototype Torre Wenaus BNL September 2001
ATLAS PPDG Program • Principal ATLAS Particle Physics Data Grid deliverables: • Year 1: Production distributed data service deployed to users. Will exist between CERN, BNL, and at least four US grid testbed sites (ANL, LBNL, Boston U, Indiana, Michigan, Oklahoma, Arlington) • Year 2: Production distributed job management service • Year 3: Integration of all distributed services into ‘transparent’ distributed processing capability integrated into ATLAS software • This work is focused on the principal PPDG year 1 deliverable. • Enables us to participate in and benefit from grid middleware development while delivering immediately useful capability to ATLAS • Looking beyond data storage to the larger issue of data management has received little attention in ATLAS to date. Torre Wenaus, BNL
DBYA Magda • DBYA rapid prototyping tool for distributed data cataloging initiated 3/01 to jump start data manager development • Stable operation cataloging ATLAS files since 5/01 • Globus integration: gsiftp in use, replica catalog in progress • Deployed to ANL, LBNL in addition to BNL, CERN 8/01 • Good basis for developing the distributed data manager to fulfill the main ATLAS PPDG year 1 deliverable • DBYA now branching into Magda (MAnager for Grid-based DAta) to be developed as this production data manager • DBYA itself reverts to a rapid prototyping tool (sandbox) • Developers are currently T. Wenaus, W. Deng (BNL) See http://www.usatlas.bnl.gov/magda/info Torre Wenaus, BNL
Architecture & Schema • Partly based on NOVA distributed analysis project used in STAR • MySQL database at the core of the system • DB interaction via perl, C++, java, cgi (perl) scripts • C++ and Java APIs autogenerated off the MySQL DB schema • User interaction via web interface and unix commands • Principal components • File catalog covering arbitrary range of file types • Data repositories organized into sites and locations • Computers with repository access: a host can access a set of sites • Logical files can optionally be organized into collections • Replication, file access operations organized into tasks • To serve environments from production (DCs) to personal (laptops) Torre Wenaus, BNL
Location Location Site Site Location Location Location Location Replication task Architecture Diagram Collection of logical files to replicate Spider Mass StoreSite Location Location Disk Site Source to cache stagein Location Host 1 Location Location Cache MySQL Synch via DB scp, gsiftp Source to dest transfer Host 2 Site Location Location Location Spider Register replicas Catalog updates Torre Wenaus, BNL
Files and Collections • Files • Logical name is filename, without path except for CVS-based files (code) and web files for which logical name includes path within repository • Logical name plus virtual organization defines unique logical file in system • File instances include a replica number • Zero for the master instance; N=locationID for other instances • Notion of master instance is essential for cases where replication must be done off of a specific (trusted or assured current) instance • Collections • Several types supported: • Logical collections: arbitrary user-defined set of logical files • Location collections: all files at a given location • Key collections: files associated with a key or SQL query Torre Wenaus, BNL
Distributed Catalog • Catalog of ATLAS data at CERN, BNL • Supported data stores: CERN Castor, CERN stage, BNL HPSS (rftp service), AFS and NFS disk, code repositories, web sites • Current content: TDR data, test beam data, ntuples, code, ATLAS and US ATLAS web content, … • Some content (code, web) added more to jack up the file count to test scalability and file type diversity than because they represent useful content at present • About 300k files cataloged representing >2TB data • Has also run without problems with ~1.5M files cataloged • Running stably since May ’01 • Coverage recently extended to ANL, LBNL • Other US ATLAS testbed sites should follow soon • ‘Spiders’ at all sites crawl the data stores to populate and validate catalogs • ‘MySQL accelerator’ implemented to improve catalog loading performance between CERN and BNL by >2 orders of magnitude; 2k files cataloged in <1sec Torre Wenaus, BNL
Data (distinct from file) Metadata • Keys • Arbitrary user-defined attributes (strings) associated with a logical file • Used to tag physics channels, histogram files, etc. • Logical file versions • Version string can be associated with logical file to distinguish updated versions of a file • Currently in use only for source code (version is the CVS version number of the file) • Data signature, object cataloging • Coming R&D; nothing yet Torre Wenaus, BNL
File Replication • Supports multiple replication tools as needed and available • Automated CERN-BNL replication incorporated 7/01 • CERN stage cache scp cache BNL HPSS • stagein, transfer, archive scripts coordinated via database • Transfers user-defined collections keyed by (e.g.) physics channel • Just extended to US ATLAS testbed using Globus gsiftp • Currently supported testbed sites are ANL, LBNL, Boston U • BNL HPSS cache gsiftp testbed disk • BNL or testbed disk gsiftp testbed disk • gsiftp not usable to CERN; no available grid node there (til ’02?!) • Will try GridFTP, probably also GDMP (flat files) when available Torre Wenaus, BNL
Data Access Services • Command line tools usable in production jobs: under test • getfile • Retrieve file via catalog lookup and (as necessary) staging or (still to come) remote replication • Local soft link to cataloged file instance in a cache or location • Usage count maintained in catalog to manage deletion • releasefile • Removes local soft link, decrements usage count in catalog, deletes instance (optionally) if usage count goes to zero • Callable APIs for catalog usage and update: to come • Collaboration with David Malon on Athena integration Torre Wenaus, BNL
Near Term Schedule • Complete and deploy data access services (2 wks) • Globus integration and feedback • Remote command execution (2 wks) • Test Globus replica catalog integration (1 mo) • Incorporate into Saul Youssef’s pacman package manager (1 mo) • ATLAS framework (Athena) integration (with D.Malon) (2 mo) • Look at GDMP for replication (requires flat file version) (2 mo) • Application and testing in coming DCs, at least in US • Offered to intl ATLAS (Gilbert, Norman) as a DC data management tool • Design and development of metadata and data signature (6 mo) • Study scalability of single-DB version, and investigate scalability via multiple databases (9 mo) Torre Wenaus, BNL