250 likes | 380 Views
Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially a draft straw-man proposal for a ‘Bulk Copy Document’?) david.meredith@stfc.ac.uk. Bulk Data Copy Description Generalizations (some DMI/JSDL overlap). Overview .
E N D
Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially a draft straw-man proposal for a ‘Bulk Copy Document’?) david.meredith@stfc.ac.uk Bulk Data Copy Description Generalizations (some DMI/JSDL overlap)
Overview • Some Overlap in Data Copy Activity Descriptions (JSDL and DMI) • JSDL Data staging and Bulk copies • DMI and bulk copies • Some new draft proposals for DMI To address Bulk Data Copying • Reuse of proposed DMI-common element set • Some other stuff to consider
Some Overlap in Data Copy Activity Descriptions (JSDL and DMI) • Some overlap between JSDL Data Staging and DMI. • The Source/Target <jsdl:DataStaging/> element is roughly similar to Source/Sink <dmi:DEPR/> element. • Both capture the source/target URI and credentials. • At present, neither JSDL DS or DMI fully captures our requirements (this is not a criticism, they are each intended to address their existing use cases which only partially overlap with the requirements for a bulk data copy activity !). Other • Condor Stork - based on Condor Class-Ads (see supplementary slides) • Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ?
Using JSDL Data Staging elements to simulate a bulk data copy activity Bulk Copy: Recursive file/dir copying between multiple sources and sinks JSDL Data staging and Bulk copies
JSDL Data Staging and the HPC File Staging Profile for Bulk Data Copying <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination> <jsdl:Source> <jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Source> <jsdl:Target> <jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Target> <Credentials> … </Credentials> </jsdl:DataStaging> JSDL Staging 1 Define both the source and target within the same <DataStaging/> element which is permitted in JSDL. But the HPC File Staging Profile (Wasson et al. 2008) limits to a single credential definition within a data staging element. Possibility; maybe profile use of Credentials within Source/Target elements ?
<jsdl:DataStaging> <jsdl:FileName> fileA </jsdl:FileName> <jsdl:FilesystemName> MY_SCRATCH_DIR </jsdl:FilesystemName> <jsdl:CreationFlag> overwrite </jsdl:CreationFlag> <jsdl:DeleteOnTermination> true </jsdl:DeleteOnTermination> <jsdl:Source> <jsdl:URI> gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA </jsdl:URI> </jsdl:Source> <Credentials> e.g. MyProxyToken</Credentials> </jsdl:DataStaging> <jsdl:DataStaging> <jsdl:FileName> fileA </jsdl:FileName> <jsdl:FilesystemName> MY_SCRATCH_DIR </jsdl:FilesystemName> <jsdl:CreationFlag> overwrite </jsdl:CreationFlag> <jsdl:Target> <jsdl:URI> ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Target> <Credentials> e.g. wsa:Username/password token </Credentials> </jsdl:DataStaging> JSDL Staging 2 • A source element for fileA and a corresponding target element for staging-out of the same file. • Link <DataStaging/> elements via common <FileName/> and <FilesystemName/>. • By specifying that the input file is deleted after the job has executed, staging can be used to perform a data copy from one location to another via the staging host (intermediary) .
Using Staging to Enact Bulk Copies • In the context of bulk copying, the file staging host (intermediary) is redundant: • No need to explicitly name and aggregate (stage) files on a staging host (when copying between a source and sink, the staging host is a hidden implementation detail). • No equivalent <dmi:DataLocations/> for defining alternative locations for a source and sink (a nice feature of DMI). • JSDL is designed to describe a single activity which is atomic from the perspective of an external user (staging is part of this atomic activity). In bulk copying, we need to identify and report on the status of each copy operation. • Some additional elements are required (e.g. <dmi:TransferRequirements/>, <other:FileSelector/>, abstract <URIConnectionProperties/> for connecting to different URI schemes, e.g. iRODS/SRB require ‘McatZone’ ‘defaultResoruce’ propertes). Are these new elements out of scope (remain proprietary?)
An overview of OGSA DMI and some current limitations for Bulk Copying DMI and bulk copies
OGSA DMI Overview • The OGSA Data Movement Interface (DMI) (Antonioletti et al. 2008) defines a number of elements for describing and interacting with a data transfer activity. • The data source and destination are each described separately with a Data End Point Reference (DEPRs), which is a specialized form of WS-Address element (Box et al. 2004). • In contrast to the JSDL data staging model, a DEPR facilitates the definition of one or more <Data/> elements within a <DataLocations/> element. This is used to define alternative locations for the data source and/or sink. • An implementation can select between its supported protocols and select/retry different source/sink combinations (improves resilience and the likelihood of performing a successful copy).
<dmi:SourceOrSinkDataEPR> • <wsa:Address>http://www.ogf.org/ogsa/2007/08/addressing/none</wsa:Address> • <wsa:Metadata> • <dmi:DataLocations> • <dmi:DataProtocolUri="http://www.ogf.org/ogsadmi/2006/03/im/protocol/gridftp-v20" • DataUrl="gsiftp://example.org/name/of/the/dir/"> • <dmi:Credentials><other:MyProxyToken/></dmi:Credentials> • <other:stuff/> • </dmi:Data> • <dmi:DataProtocolUri="urn:my-project:srm" • DataUrl="srm://example.org/name/of/the/dir/"> • <dmi:Credentials><wsse:UsernameToken/></dmi:Credentials> • <other:stuff/> • </dmi:Data> • </dmi:DataLocations> • </wsa:Metadata> • </dmi:SourceOrSinkDataEPR> • <dmi:TransferRequirements> <dmi:StartNotBefore/> ? <dmi:EndNoLaterThan/> ? <dmi:StayAliveTime/> ? <dmi:MaxAttempts/> ?</dmi:TransferRequirements> DMI DEPR and Transfer Requirements Source or Sink (wsa:EndpointReference type) DEPR defines alternativelocations for the data source /sink and each <Data/> nests its own credentials. Transfer Requirements (needs some extending) DMI Data Transfer Factory Interface (representation) [supported protocols] + [service instance] GetDataTransferInstance([SourceDEPR],[SinkDEPR],[TransferRequirements]); [factory attributes] GetFactoryAttributesDocument();
Current DMI Limitations for Bulk Copying (for multiple sources and sinks) • DMI is intended to describe only a single data copy operation between one source and one sink (this is not a criticism, this is by design for managing low-level transfers of single data units). To do several transfers, client needs to perform multiple invocations of a DMI service factory would be required to create multiple DMI service instances. • We require a single message packet that wraps multiple transfers into a single ‘atomic’ activity rather than having to repeatedly invoke the DMI service factory (broadly similar to defining multiple JSDL data staging elements). • Some of the existing functional spec elements require extension / slight modification (in particular addition of <xsd:any/> and <xsd:anyAttribute/> extension points to embed proprietary info in suitable locations).
Note, The draft proposals presented here for bulk data copying are only intended for review/discussion/sanity-check/agreement (or not) Some new draft proposals for DMI To address Bulk Data Copying
Draft Proposal 1 – New <BulkDataCopy/> and <DataCopy/> Elements • Add new elements to describe a bulk copy activity – effectively wrap multiple source-sink pairs within a single (standalone) document e.g. <BulkDataCopy/> with nested <DataCopy/> <!-- Draft: TO REVISE/DISCUSS/SANITY-CHECK --> <BulkDataCopy id="xsd:ID"?> <DataCopy id="xsd:ID"?> + <!--one-to-many--> <SourceDEPR/> <SinkDEPR/> <DataCopyTransferRequirements/> ? <!-- needed ? --> <xsd:any##other/> * <DataCopy/> <TransferRequirements/> ? <xsd:any##other/>* </BulkDataCopy> Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc…. • The outer <TransferRequirements/> applies to the whole bulk copy (wrapping elements that span all the sub-copies, e.g. including the <dmi:MaxAttempts/>, <dmi:StartNotBefore/> and other batch-window properties). • Define an optional <DataCopyTransferRequirements/> for each <DataCopy/> in order to specify an additional and overriding requirement sub-set (e.g. for defining <FileSelector/> elements etc).
Draft Proposal 2 – Introduce a New DMI Port Type • Add a new DMI port type to accept <BulkDataCopy/> doc (current port type defines separate [SourceDEPR], [SinkDEPR], [TransferRequirements] arguments). • Choice of two port types. • Some minor changes to the existing functional spec (mostly adding xsd:any extension points and other small stuff). Possible DMI Data Transfer Factory Interface Extension (draft representation) [supported protocols] + [service instance] GetDataTransferInstance([BulkDataCopy]); [factory attributes] GetFactoryAttributesDocument(); • As per the existing Functional Spec; completely separate the activity description (BulkDataCopy) from the service interface rendering in order to define a generic and reusable element set. Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc….
Draft Proposal 3 – Extend <State/> and <InstanceAttributes/> and describe usage for bulk copying • Since a Bulk Copy consists of multiple transfers, we need to optionally provide a way to report the status of each sub-copy. • The (sub) state of each <DataCopy/> could be optionally nested within the <dmi:Detail/> element as part of the parent <dmi:State/> (i.e. in place of the existing <xsd:any/> extension point). In order to specify each sub-copy identifier, the <dmi:State/> could be extended by adding an <xsd:anyAttribute /> : <!-- Draft: TO REVISE/DISCUSS/SANITY-CHECK --> <dmi:State value=“Transferring”> <dmi:Detail> <dmi:State dataCopyId=“subcopy1” value=“Done”> <dmi:State dataCopyId=“subcopy3” value=“Failed:Unclean”> <dmi:State dataCopyId=“subcopy2” value=“Transferring”> . . . </dmi:Detail> </dmi:State> • Similarly, child <dmi:InstanceAttributes/> could be optionally nested within a parent <dmi:InstanceAttributes/> to represent each sub-copy using a similar approach. But is this actually necessary ? (don’t think so since the <dmi:TotalDataSize/> could be calculated across all the sub-copies). Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc….
Draft Proposal 4 – Other proposed modifications (possibly some more not listed here) • Add <xsd:any/> and <xsd:anyAttribute/> extension points to the existing DMI elements, e.g. in dmi:DataType dmi:DataLocationsType complex types, anyAttribute in dmi:State etc…. <complexType name="DataType"> <annotation> . . . </annotation> <sequence> <element name="Credentials" type="dmi:CredentialsType" minOccurs="0" /> <xsd:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/> </sequence> <attribute name="ProtocolUri" type="anyURI" use="required" /> <attribute name="DataUrl" type="anyURI" use="required" /> <xsd:anyAttribute namespace="##other" processContents="lax"/> </complexType> <complexType name="DataLocationType"> <annotation> . . . </annotation> <sequence> <element name="Data" type="dmi:DataType" maxOccurs="unbounded" /> <xsd:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/> </sequence> <xsd:anyAttribute namespace="##other" processContents="lax"/> </complexType> • Move elements referred to in the text of the functional spec into the functional spec schema, such as <FactoryAttributes/> and the fault types (currently defined in the plain WS Rendering schema). • Some additional elements are required (e.g. <dmi:TransferRequirements/>, <other:FileSelector/>, abstract <URIConnectionProperties/> for connecting to different URI schemes, e.g. iRODS/SRB require ‘McatZone’ ‘defaultResoruce’ propertes). Are these new elements out of scope or should they remain proprietary?)
As per the existing DMI Functional Spec, the Bulk Copy activity description would be clearly separated from the service interface rendering . This promotes a generic and reusable element set which can be adopted for use within other specs/profiles , e.g. a new bulk copy application definition for the <jsdl:Application/> element. REUSE of Proposed DMI-common element set
<jsdl:JobDefinition> <jsdl:JobDescription> <jsdl:JobIdentification ... /> <jsdl:Application> <!– Possibility? Embed new ‘BulkDataCopy’ doc as a new Application element akin to POSIXApplication or HPCProfileApplication elems --> <other:BulkDataCopyApplication> <dmi:BulkDataCopy> . . . </dmi:BulkDataCopy> </other:BulkDataCopyApplication> </jsdl:Application> <jsdl:Resources/> </jsdl:JobDescription> </jsdl:JobDefinition> Draft usage in JSDL 1 • JSDL intended to be a generic compute activity description language (not just solely HPC). • In this example, a bulk data copy activity doc is used to describe as a jsdl application. • Could nest the proposed <BulkDataCopy/> document within the <jsdl:Application/> element. The <jsdl:Application/> element is a generic wrapper that is intended for this very purpose, e.g. akin to nesting <POSIXApplication/> or <HPCProfileApplication/>.
<jsdl:JobDefinition> <jsdl:JobDescription> <jsdl:JobIdentification ... /> <jsdl:Application> <!– Possibility? Stage BulkDataCopy doc and explicitly name the copy agent that would enact the copy activity --> <jsdl-posix:POSIXApplication> <jsdl-posix:Executable>/usr/bin/datacopyagent.sh<jsdl-posix:Executable> <jsdl-posix:Argument>‘my_BulkDataCopyDoc.xml’</jsdl-posix:Argument> </jsdl-posix:POSIXApplication> </jsdl:Application> <jsdl:Resources> <jsdl:DataStaging> <jsdl:FileName>my_BulkDataCopyDoc.xml</jsdl:FileName> . . . </jsdl:DataStaging> </jsdl:Resources> </jsdl:JobDescription> </jsdl:JobDefinition> Draft usage in JSDL 2 This is a less ‘contract-driven’ approach, but represents a perfectly valid re-use of the proposed <BulkDataCopy/> Document. Stage-in <BulkDataCopy/> document as input for the executable.
Cancelled Running: Transferring Pending Finished dmi:Suspend () Request dmi:Resume ()Request Failed: Clean Unclean Unknown Running: Suspended Draft DMI sub-state specialisations in BES • Profile the OGSA BES state model to account for DMI sub-state specializations and dmi lifecycle events (). • Adds optional DMI sub-state specializations. Client/service may only recognize the main BES states if necessary. • Adds optional DMI lifecycle events (dmi:suspend, dmi:resume). • Add DMI fault types? bes:TerminateActivities () Request BES states DMI sub-states Bes and DMI Lifecycle Events in italics (i.e. Requests/operations)
Some other stuff to consider • JSDL-BES may be a better route for more widespread adoption of a bulk copy document ? (e.g. consider existing BES implementations) • Is orchestration of the proposed <DataCopy/> activities required ? (e.g. sequential /ordering or even DAG ?). As yet, no compelling use-cases so far. • For the proposed bulk copy doc; What about using element references rather than defining solely ‘in-line’ XML docs to cut down on element repetition (e.g. akin to <jsdl:FileSystem/> element which can be referenced through <jsdl:FilesystemName/> elements). Abstract elements and Substitution groups may also be useful here. <BulkDataCopy id=”MyBulkTransferA”> <CopyResources> <Credential id=”cred1”.../> <Credential id=”cred2”.../> <TransferRequirements id=”tr1” .../> <TransferRequirements id=”tr2” .../> <DataEPR id=”data1” .../> <DataEPR id=”data2” .../> <DataEPR id=”data3” .../> </CopyResources> <DataCopy id=”subTransferA”> <SourceDEPR idref=”data1”/> <SinkDEPR idref=”data3”/> <TransferRequirementsRef idref=”tr1”/> </DataCopy> <DataCopy id=”subTransferB”> <SourceDEPR idref=”data2”/> <SinkDEPR idref=”data3”/> <TransferRequirementsRef idref=”tr2”/> </DataCopy> </BulkDataCopy> Element ‘id’ and subsequent ‘idref’s Reduces XML repetition but validation does not check for the correct types of referenced elements.
Supplementary slides Other stuff / extra slides….
Message Model Requirements • Document Message • Bulk Data Copy Activity description • Capture all information required to connect to each source URI and sink URI and subsequently enact the data copy activity. • Transfer requirements, e.g. additional URI Properties, file selectors (reg-expression), scheduling parameters to define a batch-window, retry count, source/sink alternatives, checksums?, sequential ordering? DAG? • Serialized user credential definitions for each source and sink. • Control Messages • Interact with a state/lifecycle model (e.g. stop, resume, cancel) • Event Messages • Standard fault types and status updates • Information Model • To advertise the service capabilities / properties / supported protocols
In-Scope • Job Submission Description Language (JSDL) • An activity description language for generic compute applications. • OGSA Data Movement Interface(DMI) • Low level schema for defining the transfer of bytes between and single source and sink. • JSDL HPC File Staging Profile (HPCFS) • Designed to address file staging not bulk copying. • OGSA Basic Execution Service (BES) • Defines a basic framework for defining and interacting with generic compute activities: JSDL + extensible state and information models. • Others that I am sure that I have missed ! (…ByteIO) • Neither fully captures our requirements (not a criticism, they are designed to address their use-cases which only partially overlap with the requirements for our bulk data copy activity). Other • Condor Stork - based on Condor Class-Ads • Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ? – I believe Ravi was originally supportive of a DMI for data transfers between multiple sources/sinks
Stork – Condor Class Ads Example of a Stork job request: [ dest_url= "gsiftp://eric1.loni.org/scratch/user/"; arguments = ‐p 4 dbg ‐vb"; src_url = "file:///home/user/test/"; dap_type = "transfer"; verify_checksum = true; verify_filesize = true; set_permission = "755" ; recursive_copy = true; network_check = true; checkpoint_transfer = true; output = "user.out"; err = "user.err"; log = "userjob.log"; ] • Purportedly the first batch scheduler for data placement and data movement in a heterogeneous environment . Developed with respect to Condor • Uses Condor’s ClassAd job description language and is designed to understand the semantics and characteristics of data placement tasks • Recent NSF funding to develop as a production service