210 likes | 292 Views
Data Management. The European DataGrid Project Team http://www.eu-datagrid.org. Problem Statement: How to connect User/Programs/Data?. User logged in to a Grid “User Interface” machine, or Logged in to a “desktop” machine Programs On desktop On UI On Grid machines “god knows where” Data
E N D
Data Management The European DataGrid Project Team http://www.eu-datagrid.org
Problem Statement:How to connect User/Programs/Data? • User • logged in to a Grid “User Interface” machine, or • Logged in to a “desktop” machine • Programs • On desktop • On UI • On Grid machines “god knows where” • Data • May need to supply (Grid or non-Grid) data to GNW programs • GNW program may generate data, need to put it somewhere safe • How do you retrieve it from somewhere safe?
Common Grid Data Management Tasks • Dealing with Data Your Job Generates • Getting the data back to your desktop • Putting the data “on the Grid” • Getting Data to your Job • Submitting data along with your job • Putting your data onto the Grid (from outside) • Sending your Grid job to your Grid data • Moving Data on the Grid • How to find your data if you don’t remember where you put it • Example scripts and files: ~dgttutor/dm-tests/
Grid Data Management Tools • Data Transfer mostly through gsiftp • Like good old FTP except uses grid auth(oriza)(entica)tion • No passwords! • Can also use multiple streams for faster transfer • Resource Broker can send (small amounts) of data to/from jobs • Replica Catalog keeps track of where various copies of “grid datasets” are located • Edg-replica-manager uses gsiftp & RC to manage instantiation, registration, and replication of grid datasets • Resource Broker can use RC to find your data, and send your job to it, if you tell RB about the data you need
Grid Program -> Data on your desktop • You can set up your job for “data pickup” • Job generates data in current working directory on WN • At job end, the data files are placed in temp storage at RB • You get them back via “dg-job-get-output” • Key items: • You need to know names of files you want to get back • OutputSandbox = {“higgs.root",“graviton.HDF"}; • not intended for large files (> hundred MB) – storage limitation on Resource Broker machine • Example: output-sandbox.{jdl,sh}
Putting the data “on the Grid” • Here we talk about a running Grid program, the output of which you want on the Grid. Two cases: • You let the program write output on the WN, and after the program finishes you have the job script move the data to Grid storage • You arrange for the program to write directly to Grid storage • In both cases, data is not really “on the Grid” until it is registered in the “replica catalog”
Grid-generated data to Grid storage I • Your program generates data to some local file • You have to know (or be able to figure out) what the local file name is • Use the edg-replica-manager commands to • Put the data onto Grid storage • Register the data as a Grid dataset • A few extras are needed • Some idea of where to put the data • A “logical file name” – location-independent grid file name
GGDGS (I) Cont’d • How to find out where to put data? Need to know which storage elements are out there • ldapsearch -h lxshare0225.cern.ch -p 2170 -x -b \"Mds-vo-name=local,o=grid" (objectclass=storageelement) \ seid • The command which will move your data to the desired location, and register it in the replica catalog, is edg-replica-manager-copyAndRegisterFile • edg-replica-manager-copyAndRegisterFile \ -s $(hostname)/$(pwd)/$DFILE -l $LFN -d $DEST_SE • See cr-mov-reg.{sh,jdl} examples
Grid-generated data to Grid storage II • Your program generates data directly to a “close SE” • Close means you can use normal file IO to write it • You have to use a brokerinfo command to find out what the close SE is (you don’t know where your job will go!) and what the dir is • You write the data • Use the edg-replica-manager commands to • Register the data as a Grid dataset • An extra is needed • A “logical file name” – location-independent grid file name
GGDGS II (cont’d) • Restriction: the “local file name” has to be the same as the logical file name (at least the “base” name) • File on disk: /data/spool/123fred7; LFNs: • 123fred7 is OK • 123fred is not OK • fred7 is not OK • Skippy is not OK • spool/123fred7 is OK • Logical file name must not already be in catalogue • You also probably want to check that the file doesn’t exist on disk before you start to write it • Example files: cr-on-se-and-reg.{jdl,sh} • Check if it was successful: edg-replica-manager-listReplicas -c /opt/edg/etc/tutor/rc.conf \ -l whomp.119
Submitting Data Along With Your Job • This is fairly easy: use the Input Sandbox • Careful – not a sandbox in the javascript sense • InputSandbox = {“input-ntuple.root"}; • Example files: inp-sbox.{jdl,sh}
Moving Data Onto Grid from Outside • This is almost identical to GGDGS I • Use edg-replica-manager-copyAndRegisterFile • Need to specify rc.conf file (either with RC_CONFIG_FILE variable or with –c option) … defaults in /opt/edg/etc/<vo>/rc.conf • Remember restrictions: • LFN and remote file name have to match • source and destination files must include hostnames • edg-replica-manager-copyAndRegisterFile –c rc.conf –l whomp.145 –s $(hostname)/$(pwd)/gls –d gppse05.gridpp.rl.ac.uk
Having Grid Send Job to Your Data • Need to have data “on the Grid” == listed in RC • Tell your job (JDL) about the grid data: • InputData = “LF:myfile.dat” • Resource Broker puts info about data matching in “brokerinfo” file on remote execution node • In your job execution script, use the “edg-brokerinfo” command (getselectedfile) to find location of job-local copy • Example files: find-data.{jdl,sh}
Moving Data Around • Edg-replica-manager-replicateFile –c rc.conf –l <lfn> -d <dest-SE-name> -s <source-SE-name> • Try the previous test (w/ dg-job-list-match) – should find a new site willing to accept your job
Finding Your Data • ldapsearch –LLL –h grid-vo.nikhef.nl –p 10389 –x –b “rc=EDGtutorialReplicaCatalog,dc=eu-datagrid,dc=org” ‘(filename=jtdmtest1)’ dn • Shows “dn”s wherever the selected “filename” exist
GDMP • Tool for replication of large sets of files between sites • Can do a lot with it • Easy to get commands wrong • Can’t recover from certain errors • Possible to wreck the GDMP subsystem badly enough that remote sysadmins will have to make manual fixes • Recommend not to use unless you really need it! • Ex: you don’t normally use the “dd” command to copy files!
Gotchas • Edg-replica-manager commands • Error messages not always on target • Careful not to use commands in ways other than intended – error trapping not good, and sometimes the command will do something but not necessarily what you want • Build error checking & trapping into your job scripts • Remember restrictions on LFN/PFN correspondence • Replica catalog • Leaving out pieces of the command generally neither works nor provides helpful messages – type carefully!
EDG Replica Catalog • Based upon the Globus LDAP Replica Catalog • Stores LFN/PFN mappings and additional information (e.g. filesize): • Physical File Name (PFN): host + full path & and file name • Logical File Name (LFN): logical name that may be resolved to PFNs • LFN : PFN = 1 : n • Only files on storage elements may be registered • Each VO has a specific storage dir on an SE • Example PFN: lxshare0222.cern.ch/flatfiles/SE1/iteam/file1.dat host storage dir • LFN must be full path of file starting from storage dirLFN of above PFN: file1.dat
globus-url-copy • Low level tool for secure copying globus-url-copy <protocol>://<source file> \ <protocol>://<destination file> • Main Protocols: • gsiftp – for secure transfer, only available on SE and CE • file – for accessing files stored on the local file system on e.g. UI, WN globus-url-copy file://`pwd`/file1.dat \ gsiftp://lxshare0222.cern.ch/ \ flatfiles/SE1/EDGTutorial/file1.dat
The Replica Manager APIs • (un)registerEntry(LogicalFileName lfn, FileName source) • Replica Catalogue operations only - no file transfer • copyFile(FileNamesource, FileNamedestination, Stringprotocol) • allows for third-party transfer • transfer between: • two StorageElements or • ComputingElement and Storage Element • Space management policies under development • all tools support parallel streams for file transfers
The Replica Manager APIs • copyAndRegisterFile(LogicalFileName lfn, FileName source, FileName destination, String protocol) • third-party transfer but : files can only be registered in Replica Catalogue if destination PFN contains a valid SE (i.e. needs to be registered in the RC)! • replicateFile(LogicalFileName lfn, FileName source, FileName destination, String protocol) • deleteFile(LogicalFileName lfn, FileName source)