200 likes | 363 Views
Introduction to iRODS. Jean-Yves Nief. Talk overview. Data management context. Some data management goals: Storage virtualization . Virtualization of the data management policy . Examples of data management rules . Why choose iRODS ? What is iRODS ? iRODS usage examples .
E N D
Introduction to iRODS Jean-Yves Nief
Talk overview • Data management context. • Some data management goals: • Storage virtualization. • Virtualization of the data management policy. • Examples of data management rules. • WhychooseiRODS ? • WhatisiRODS ? • iRODS usage examples. • Propects: scalability. • Summary. Introduction to iRODS - DPHEP
CC-IN2P3 activities • Data centerslike CC-IN2P3 (Lyon, France) works for international scientific collaborations. • Examples: • High EnergyPhysics: CERN (Fr/Switzeland), SLAC (USA), Fermilab (USA), BNL (USA) etc… • Astroparticlephysics / astrophysics: Auger (Argentina), HESS (Namibia), AMS (Int. Space Station). Distributedenvironment: experimental sites, data centers, collaboratorsspreadaround the world. Introduction to iRODS - DPHEP
CC-IN2P3 activities Introduction to iRODS - DPHEP
Data management context • Multidisciplinaryenvironment: • High Energy and nuclearphysics. • Astroparticle, astrophysics. • Biology, Biomedicalapps, Arts and Humanities. Variousconstraints, variousneeds for data management. Introduction to iRODS - DPHEP
Data management context • Data stored on various sites. • Heterogeneousstorage: • Data format: flat files, databases, data streams… • Storage media, server hardware: disks , tapes. • Data accessprotocols, information systems. • Heterogeneous OS on both clients and servers side. Needs to federate all this in a homogeneousway. Introduction to iRODS - DPHEP
Data management context • Not in the scope here: • Intensive parallel I/O for data analysis. • In the scope here: • Data preservation (replication, consistency …). • Data accessdistributed over different sites. • Data life cycle (file format transformation, data workflows, interactions withvarious info systems). • Need for virtualization of the storage: • Logicalview and organization of the data. • Data migration to new hardware/software transparent to the end clients tools: no view of the physical location of the data and underneath technologies. • Logicalview of the data unique to all the usersindependently of their location. • Virtual organization (VO) of the users: • Unique id for each user. • Organization by groups, role (simple user, sysadminetc…). • Access rights to the data within the VO. Introduction to iRODS - DPHEP
Some data management goals • Storage virtualization not enough. • For client applications relying on these middlewares: • No safeguard. • No guarantee of a strict application of the data preservationpolicy. • Real need for a data distribution project to define a coherent and homogeneouspolicy for: • data management. • storageresource management. • Crucial for massive archivalprojects (digital libraries …). • No grid/cloud/… toolhadthesefeaturesuntil 2006. Introduction to iRODS - DPHEP
Virtualization of the data management policy • Typicalpitfalls: • No respect of givenpre-establishedrules. • Several data management applications mayco-existat the same moment. • Several versions of the same application canbeusedwithin a projectat the same time. potentialinconsistency. • Removevariousconstraints for various sites from the client applications. • Solution: • Data management policyvirtualization. • Policy expressed in terms of rules. Introduction to iRODS - DPHEP
Examples of data management rules • Customizedaccessrights to the system: • Disallow file removalfrom a particular directory even by the owner. • Security and integrity check of the data: • Automatic checksum launched in the background. • On the flyanonymization of the files even if it has not been made by the client. • Metadata registration: • Automatedmetadata registration associated to objects (inside or outside the iRODSdatabase). • Small files aggregationbefore migration to MSS. • Customizedtransferparameters: • Number of streams, stream size, TCP window as a function of the client or server IP. • … up to yourneeds … Introduction to iRODS - DPHEP
WhydidwechooseiRODS ? • Provide a solution to the aboverequirements. • SRB (iRODSpredecessor) has been usedso far: • Data virtualization. • But no policyrulebasedmechanisms. • In 2007, no « grid » toolsexceptiRODScouldprovide data management policiesbased on rule. • Scalable. • Can becustomized to fit a widevariety of use cases. Introduction to iRODS - DPHEP
WhatisiRODS ? • iRuleOrientedData Systems (DICE team: UNC, San Diego): • started in 2006. • open source. • CC-IN2P3: collaborator • In a « zone » (administrative domain): • One or severalseveral servers connected to a CentralizedMetacatalog (RDBMS) with files metadata, user informations, data locations etc… Logicalview of the data in a givenzone. • Data servers spreadgeographicallywithin a zone. • Possibility to have differentzones (separate administrative domains) interconnected. • Data management policiesexpressedwithrules in a « C-like » language: • Can betriggeredautomatically for various actions (put, get, list, rename….). • Can berunmanually. • Can berun in batch mode. • Rulesversioning. • Client interactions withiRODS: • APIs (C, Java, PHP, Python), shellcommands, GUIs, web interfaces. Introduction to iRODS - DPHEP
Logicalnamespace (example of a web interface) integrityFileOwner { #Input parameteris: # Name of collection thatwillbechecked # Ownernamethatwillbeverified #Output is: # List of all files in the collection that have a differentowner #Verifythat input pathis a collection msiIsColl(*Coll,*Result, *Status); if(*Result == 0) { writeLine("stdout","Input path *Collis not a collection"); fail; } *ContInxOld = 1; *Count = 0; #Loop over files in the collection msiMakeGenQuery("DATA_ID,DATA_NAME,DATA_OWNER_NAME","COLL_NAME = '*Coll'",*GenQInp); msiExecGenQuery(*GenQInp, *GenQOut); msiGetContInxFromGenQueryOut(*GenQOut,*ContInxNew); while(*ContInxOld > 0) { foreach(*GenQOut) { msiGetValByKey(*GenQOut,"DATA_OWNER_NAME",*Attrname); if(*Attrname != *Attr) { msiGetValByKey(*GenQOut,"DATA_NAME",*File); writeLine("stdout","File *File has owner *Attrname"); *Count = *Count + 1; } } *ContInxOld = *ContInxNew; if(*ContInxOld > 0) {msiGetMoreRows(*GenQInp,*GenQOut,*ContInxNew);} } writeLine("stdout","Number of files in *Collwithownerotherthan *Attris *Count"); } INPUT *Coll = "/tempZone/home/rods/sub1", *Attr = "rods" OUTPUT ruleExecOut Ruleexample (e.g.: Are all the files in a given collection all owned by a given user ? )
iRODS usage @ CC-IN2P3 • Used for data distribution, archiving, integration in analysis or data life cycle management workflows. • Interfacedwith: • Mass Storage Systems: HPSS. • Externaldatabases or information systems: RDBMS, Fedora Commons. • Web servers. • High energy and nuclearphysics: • BaBar: data management of the entire data set between SLAC and CC-IN2P3: total foreseen 2PBs. • dChooz: neutrino experiment (France, USA, Japanetc…): 550 TBs. • Astroparticle and astrophysics: • AMS: cosmic ray experiment on the International Space Station (600 TBs). • TREND, BAOradio: radioastronomy (170 TBs). • Biology and biomedicalapplications(50 TBs). • Arts and Humanities: Adonis (70 TBs). Introduction to iRODS - DPHEP
iRODS usage example • archival in Lyon of the entireBaBar data set (total of 2 PBs). • automatictransferfrom tape to tape: 3-4 TBs/day (no limitation). • automaticrecovery of faultytransfers. • ability for a SLAC admin to recover files directlyfrom the CC-IN2P3 zone if data lostat SLAC. Introduction to iRODS - DPHEP
Somerulesexamples (HEP) • Mass Storage System integration: • Using compound resources: iRODSdisk cache + tapes. • Data on disk cache replicationinto MSS asynchronously (1h later) using a delayExecrule. • Recoverymechanism: retries untilsuccess, delaybetweeneach retries isdoubledateach round. • ACL management: • Rulesneeded for fine granularityaccessrights management. • Eg: • 3 groups of users (admins, experts, users). • ACLs on /<zone-name>/*/rawdata => admins : r/w, experts + users : r • ACLs on all otherssubcollections => admins + experts : r/w, users : r Introduction to iRODS - DPHEP
Prospects: scalability • 5.5 PBsmanaged by iRODSso far. • 26 millions files registered. No scalability issues foreseen. • Pitfalls: • Metadatascalability ? (billions of entries in the catalog ?). • Control of the number of simultaneous connections to beenforced (like for Apache servers): needed in a wideopenedenvironment. Introduction to iRODS - DPHEP
Summary • iRODS: • Lots of features for data management in a distributedenvironment. • Provides the flexibility and the freedom to interface or workdirectlywith a non limitedlist of systems. • Not the same goal as a parallel file system. Introduction to iRODS - DPHEP
Summary • SomeiRODS user examples: • USA: DataNet, NASA, NOAO, iPlant. • Canada: Virtual Observatory. • France: National Library, Strasbourg Observatory, Grenoble Campus, CINES etc… • Europe: EUDAT, Prace. • Privatesector: DDN etc… • Entreprise version ramping up (E-iRODS): • Wider source of fundings. Introduction to iRODS - DPHEP
Learn more • https://www.irods.org/index.php/Main_Page Introduction to iRODS - DPHEP