340 likes | 391 Views
Towards a loosely coupled and scalable component set for scheduling bulk data copying across different storage resources as fault tolerant batch jobs. http://code.google.com/p/dtsproject/. David Meredith 1 , Stephen Crouch 2 , Peter Turner 3 , Gerson Galang 4 , Ming Jiang 5 , Hung Nguyen 6
E N D
Towards a loosely coupled and scalable component set for scheduling bulk data copying across different storage resources as fault tolerant batch jobs.http://code.google.com/p/dtsproject/ David Meredith1, Stephen Crouch2, Peter Turner3, Gerson Galang4, Ming Jiang5, Hung Nguyen6 1NGS, Science and Technology Facilities Council, Daresbury Labs, UK, david.meredith@stfc.ac.uk 2OMII-UK, School of Electronics and Comp Sci, University of Southampton, UK, s.crouch@omii.ac.uk 3University of Sydney, Sydney, Australia, p.turner@chem.usyd.edu.au 4Victorian eResearch Strategic Initiative (VeRSI), Victoria, Australia, gerson.galang@arcs.org.au 5NGS, Science and Technology Facilities Council, Daresbury Labs, UK, ming.jiang@stfc.ac.uk 6University of Sydney, Sydney, Australia, nguyen_h@chem.usyd.edu.au Australia (DataMINX) United Kingdom:
Overview / Aims • An open-source project developing a set of loosely coupled components for efficiently brokering data copies between a wide range of (potentially incompatible) storage resources as schedulable, fault-tolerant batch jobs (ftp, gridftp, srb, irods, sftp, file, webdav, srm?). • To scale from small embedded deployments to large distributed deployments through an expandable ‘worker-node pool’ controlled through message orientated middleware (MOM, JMS). • To maximize data access and transfer efficiency through the strategic placement and subscription of worker-nodes at or between particular data sources/sinks. • To be inherently asynchronous and side-step the bandwidth, concurrency and scalability concerns for clients in networks with limited capability relative to the direct connectivity between the source and sink. • Aims to address geographical-topological deployment concerns by allowing service hosting to be either centralized (as part of a shared service), or confined to a single institution or domain. • Adoption of established design patterns and open source components which are coupled with a proposal for an open standards based messaging protocol. • Employs a single port-type document-centric model, with service semantics defined solely by the message model.
DTS Features / Intentions 1 • 1. Encourage a common messaging model • We are engaging with OGF in the definition of an open standard describing a bulk data copy activity with subsequent control and event messages. The aim is to provide a key foundation in addressing the challenges of data management. Ideally standards based; OGF engagement DMI, JSDL, also communications with Globus, Unicore, GridSAM developers (a longer term perspective). • Platform independence • Includes the worker agent that manages a bulk data copy activity, the message broker, the message channel adapters that enable the different transports and protocols, commons VFS. • Adopts well recognized Enterprise Integration Patterns • Described in Hohpe and Woolf (2003):Competing Consumers, Service Activator, Selective Consumer, Polling Consumer, Message Driven Consumer, Transport Channel Adapter, Header Based Router. http://www.enterpriseintegrationpatterns.com
DTS Features / Intentions 2 • Value in the correct framework choice – deploy out of the box features in remoting, scaling, batching: • Spring Batch; one of the only open source batch processing frameworks currently available (purportedly the only?). It provides many functions that are essential in batch processing. • Spring Integration; supports the EAI patterns identified by Hohpe and Woolf. Importantly it provides a set of inbound and outbound message-channel-adaptors for different integration options, both polling and message driven adapters (e.g. JMS subscription, file/directory polling, RMI, WS, email) • Message broker (e.g. Apache ActiveMQ or any JMS 1.2 message-channel MOM broker).
Buffering data via an intermediary when copying between incompatible resources / protocols Client provides single interface to different (potentially incompatible) storage resources, e.g. Srb GsiFtp, Ftp, Sftp, iRODS, file, Webdav. Client brokers between storage resources when third-party transfer is not available. File operations (list, upload, download, delete, rename) Client e.g. Portal/Hermes Get and Put, or Mem buffer Bit pipe Authentication tokens (un/pw, x509?) SRB/ FTP SFTP/ GSIFTP
Client-Side Intermediary Benefits Auth tokens only in memory on one computer. Self contained and interactive. Extensible for new and emerging resources/protocols. Challenges Software is required that is capable of enacting a data copying activity between a variety of sources and sinks (bit pipe via byte streams or combined get/put). The client must be constantly available throughout the duration of the transfer. Buffering of large quantities of data introduces bandwidth and concurrency concerns for clients residing on networks with limited capability (e.g. wireless connectivity) relative to the direct connectivity between the source and sink.
DTS – Remotely Placed Worker Agents • Aim: Strategically place intermediary software agent(s) (e.g. at different institutions, within a network, at a local source/sink) and remotely invoke an appropriate agent using a message router with a ‘Bulk Data Copy Activity’ executed as a fault tolerant batch process. Best practice: process data as close to where it resides as possible. • 3 Core DTS Components: • Batch/Worker Agent. Software that will mange a bulk data copy activity.Is a batch operation – automated processing of large volumes of information that is most efficiently processed without user interaction (fire + forget). • Common Message format that describes a data copying activity with subsequent control and event messages. • Lists data sources and sinks. • Transfer requirements. • User credentials. • Message Broker/Router for routing of messages to appropriate workers and scaling via the Competing Consumer pattern . So that the recipient worker can access the data on behalf of the user.
DTS Architecture (Simplified) Broker between remote sources and sinks Clients Meta-data system or data catalogue (ICAT) that provides list of data URLs and credentials. OR lightweight file operations directly interacting with source/sink (list, delete, rename) Queue Channel Data copy activity message. Data copy: Get/Put or Bit pipe Authentication tokens (un/pw, myproxy details) DTS workers Source Sink
DTS Architecture (Simplified) Broker between local source and remote sink (and vice-versa) Clients Message Bus is a combination of a messaging infrastructure, a common data model and command set to allow differentsystems to communicate through a shared set of interfaces (our message channels). http://www.enterpriseintegrationpatterns.com Facility Queue Facility / Department Y Source/Sink Home Lab Facility / Department X Source/Sink
Deployment Strategies Small– Local or embedded worker agent Med – Single worker pool Large – Multiple worker pools and message router
Source Client (Service Activator) WN P Sink 1) Lightweight local worker deployment. The worker agent is invoked by a script or is integrated into an existing application. S = Submit message (bulk copy activity document), C = Control message, e = Event message. Worker pool DMQ C Source P JobQb C ControlQ ReplyQ s c e c s e s Sink 2) Distributed deployment with a single worker pool.
Worker pool A C C AJobQ JobQ Router DMQ Worker pool B AControlQ HTTPS C P JobQa JobQb C ControlQ BJobQ BControlQ C ReplyQ ControlQ Router s s e c 3) Distributed deployment with a multiple worker pools.
Core Component Message Router / Broker Schedule and route messages to strategically placed worker agents. Scale with multiple agents using competing consumer pattern.
Scaling • How can the architecture scale for increasing loads ? • Scale Out: Competing Consumer Pattern • To scale horizontally (or scale out) means to add more nodes to a system. • Scale Up: Multi-process Service Activator • To scale vertically (or scale up) means to add resources and/or processes to a single node in a system.
Scale Out – Competing Consumer Pattern • Only requirement is that the JMS client and consumer must be able to access the broker . • This provides location independence which enables scaling and clustering of services since multiple workers can be configured to pull messages from the same queue. • If the service may become overburden and falls behind in its processing, all that is needed is to turn-up a few more worker instances to listen to the queue. • Consumers do not have to coordinate with each other which improves resilience, since workers can be added and removed without affecting each other. Queue depth ok JMS client (Producer) Broker (Queue) Worker (Consumer) Basic architecture is repeatable – use multiple brokers and queues as required, (e.g. broker clusters, master slave brokers etc).
How can the appropriate remote worker(s) be invoked: • How to invoke a worker(s) that resides at the data source and/or sink ? • How to invoke a worker(s) that is installed at my institution or within a specific network ? • How to target a specific worker ? • Multiple Destinations • Message Selectors • Hybrid Approach Message Routing
Message Routing: Multiple Destinations Multiple static/administered queues can be configured on one broker in order to partition workers into different groupings. Main Advantages: Queue depth is directly related to load. Therefore load balancing can be performed effectively since queues are not polluted with . DTS Should add new queues for different groupings (e.g. project queues, separate queues for different facilities). Main Disadvantages: Changes are required on the broker to cater for new worker groupings (configuration of new administered queues). This does not provide a high level of decoupling between message producer and consumer since changes are required to the broker. Worker groups In DTS, multiple destinations are used to partition static queue consumer cluster groups, e.g. Request Q per facility, beam-line, project, institution etc. Request Qa JMS clients Group A (Facility A) Request Qb Group B (Project B) Request Qc Group C (Institution C) Broker
Message Routing: Message Selectors • Message Selectors - workers can be ‘Selective Consumers‘ and clients can be ‘Specifying Producers’. A message selector is an expression based on SQL92 conditional syntax, e.g. • Facility=‘FacilityX‘ AND BeamLine=‘ProteinMX’ AND WorkerAccessKey=‘abcdefadsf_guuid' • Filtering is performed by the broker – it delivers only those messages that match the selective consumer’s criteria. • Importantly, workers can therefore decide which messages to process depending on their own selector statements. • Main benefit is that this approach is extensible: provides for a higher level of decoupling between message producer and receiver since clients and workers can be easily added without change to the broker. • Selectors are optional, this pattern can also be combined with multiple destination approach to route messages as required (hybrid approach). • Selectors can be used to perform fine-grained routing and route messages however you require, e.g. • Route to first available worker in a particular group that specifies a common/shared selector value, e.g. a common ‘groupID’ AND/OR ‘networkID’ AND/OR ‘facilityGroup’ AND/OR ‘domain’ AND/OR ‘GB limit’ etc…. (SQL). • Can route to a specific worker using a unique and opaque client identifier/access key, e.g. GUUID (this is ok since the broker performs filtering so different workers don’t see each others selectors). Specifying producer would need to persist this value between server re-starts/different sessions. = Request Q Selective Consumers Specifying Producers = = Messages with selection values
Message Routing: Hybrid Approach Best approach is to use a combination of the message filtering approach and the multi-destination approach to suit your service instance requirements. Each approach is not mutually exclusive and can be used together provided both patterns are catered for in your system. Request Qa Request Qb
Request Response (Client Worker Conversation) ReplyTo header Application ID exchange with message filtering Temporary queues
Request Response (Conversation) Request message contains a Return Address that indicates where to send the reply. Return Addressis added to the message header. Consumer does not need to know where to send the reply, it can just ask the request. Reply Channel 1 Reply Channel 2 Request Channel Specifying Producers (Clients) Selective Consumer (Workers) Reply Channel 1 Reply Channel 2 Variations of this pattern depending on clients requirements: Further expand the Message Filtering Approach to Exchange client and worker Application IDs. Client can also selectively consume response messages with its own client ID added to request header. Temporary queue created by the client (lasts only for duration of client session).
Request Response (Conversation) using Filtering DTS Clients DTS Workers Q Consumer Cluster ‘facilityA’ JMS Message Headers MessageID = guuidA WorkerGroupID = facilityA ClientID = DTSClient1 MDP Selective Consumer Pool on WorkerGroupID = facilityA NGS Portal (An App. Bounded to facilityA ) MDP Producer Pool Connected to InvokeClientQ JobSubmitQ MDP Selective Consumer Pool on WorkerID = workerA DTS Client1 1) MDP Producer Pool Connected to JobSumitQ JMS Message Headers CorrelationID = guuidA WorkerID = workerA ClientID = DTSClient1 2) MDP Selective Consumer Pool on ClientID = DTSClient1 3) MDP Producer Pool Connected to InvokeWorkerQ InvokeClientQ Q Consumer Cluster ‘facilityB’ GridSAM (An App. Bounded to facilityB ) JMS Message Headers CorrelationID = guuidA WorkerID = workerA ClientID = DTSClient1 (Exchange of client and worker Application IDs so that recipient worker and client can converse) InvokeWorkerQ
Request Response (Conversation) using Filtering • Each JMS client (worker and client) has a unique instance/application ID (clientID, workerID). • A client sends a job request and adds its own clientID to the headers (in conjunction with the other headers used in message selection, e.g. MessageID and WorkerGroupID). • Worker picks up a message and responds to an administered response queue (not a dynamic queue) via the ReplyTo header and itself returns its own WorkerID and forwards the given ClientID in the message header. • Client receives messages from the response queue and filters on ClientID. • Client can now converse with the recipient worker since both the client and worker have their respective IDs and can correlate messages on the original message ID using CorrelationID. • Using this approach only requires a limited number of administered queues: e.g. JobSumitQ, InvokeClientQ, InvokeWorkerQ . • Main benefit is that this approach is extensible: provides for a higher level of decoupling between message producer and receiver since clients and workers easily added without change to the broker. • Can also combine this approach with multiple channels as required (hybrid approach).
Core Component Batch / Worker Agent Enacts the Bulk Data Copy Activity as a fault tolerant batch job for copying between sources and sinks. Scopes, checkpoints and restarts.
Batch / Worker Agent • Role is to enact the data copy activity according to the activity document, report status events and respond to control messages. • Copy activity is a batch processing task (automated processing of large volumes of information is most efficiently processed without user interaction). • DTS worker based on Spring Batch and Commons VFS (contract driven approach facilitates different implementations e.g. scripts / shelling out to command line client). • Spring Batch provides framework for functions that are essential in batch processing e.g. split/monitor/merge, logging/tracing, tx management, processing statistics, job pause and restart, skip, retry, check-pointing. A Spring Bach implementation deals with breaking apart the business logic and sharing it efficiently between parallel processes or processors as step-jobs. http://static.springsource.org/spring-batch/index.html
Core Component Message Model Bulk Data Copy Activity Document. Control Messages (stop, start, cancel) Event Messages (faults, status, instance attributes)
Message Model Requirements • Document Message • Bulk Data Copy Activity description • Captures all information required to connect to each source and sink URI and subsequently enact the activity. • Transfer requirements e.g. URI Properties, file selectors (reg-expression), scheduling (batch-window), retry count, source/sink alternatives, checksums?, sequential ordering? DAG? • Serialized user credentials. • Probably adopt/extend the Data End Point Reference (DEPR) construct from DMI. A specialized form of WS-Address element which does not mandate any particular URL/transport scheme, multiple <DataLocations/> • Control Messages • Interact with a state/lifecycle model (e.g. stop, resume, cancel) • Event Messages • Standard fault types and status updates • Information Model • To advertise the service capabilities / properties / supported protocols
Existing/In-Scope Specifications Related Specifications • Job Submission Description Language (JSDL) • An activity description language for generic compute applications. • OGSA Data Movement Interface(DMI) • Low level schema for defining the transfer of bytes between and single source and sink. • JSDL HPC File Staging Profile (HPCFS) • Designed to address file staging not bulk copying. • OGSA Basic Execution Service (BES) • Defines a basic framework for defining and interacting with generic compute activities: JSDL + extensible state and information models. • Neither fully captures our requirements (this is not a criticism of these specs, they are designed to address their existing use-cases which only partially overlap with the requirements for a bulk data copy activity). Proprietary • Condor Stork - based on Condor Class-Ads • Glite JDL (again based on a Class-Ads) • Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ?
JSDL Data Staging 1 and the HPC File Staging Profile <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination> <jsdl:Source> <jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Source> <jsdl:Target> <jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Target> <Credentials> … </Credentials> </jsdl:DataStaging> define both the source and target within the same <DataStaging/> element which is permitted in JSDL. However, the HPC File Staging Profile (Wasson et al. 2008), which is an extension to JSDL, limits the use of credentials to a single credential definition within a data staging element. Often, different credentials will be required for the source and the target.
JSDL Data Staging 2 <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:FilesystemName>DL_HOME</jsdl:FilesystemName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination> <jsdl:Source> <jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Source> <Credentials> … </Credentials> </jsdl:DataStaging> <jsdl:DataStaging> <jsdl:FileName>fileA</jsdl:FileName> <jsdl:FilesystemName>NGS_HOME</jsdl:FilesystemName> <jsdl:CreationFlag>overwrite</jsdl:CreationFlag> <jsdl:Target> <jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Target> <Credentials> … </Credentials> </jsdl:DataStaging> Coupled staging elements; A source data staging element for fileA and a corresponding target element for staging out of the same file. By specifying that the input file is deleted after the job has executed, this example simulates the effect of a data copy from one location to another through the staging host. No multiple data locations (alternative sources and sinks). More elements required (e.g. transfer requirements, file selectors, uri properties). Intended for compute and data staging, not really bulk data copying.
OGSA DMI The OGSA Data Movement Interface (DMI) (Antonioletti et al. 2008) defines a number of XML constructs for describing and interacting with a data transfer activity. The data source and destination are each described separately with a Data End Point Reference (DEPRs), which is a specialized form of WS-Address element (Box et al. 2004). In contrast to the JSDL data staging model, a DEPR facilitates the definition of one or more <Data/> elements within a <DataLocations/> element. This is used to define alternative locations for the data source and/or sink. In doing this, an implementation is then free to select between its supported protocols and retry different source/sink combinations from the available list. This improves resilience and the likelihood of performing a successful data transfer by matching protocols supported by the service.
DEPR Example <dmi:SourceDataEPR> <wsa:Address>http://www.ogf.org/ogsa/2007/08/addressing/none</wsa:Address> <wsa:Metadata> <dmi:DataLocations> <dmi:Data ProtocolUri="http://www.ogf.org/ogsadmi/2006/03/im/protocol/gridftp-v20" DataUrl="gsiftp://example.org/name/of/the/dir/"> <dmi:Credentials><wsse:UsernameToken/></dmi:Credentials> <other stuff/> </dmi:Data> <dmi:Data ProtocolUri="urn:my-project:srm" DataUrl="srm://example.org/name/of/the/dir/"> <dmi:Credentials><wsse:UsernameToken/></dmi:Credentials> <other stuff/> </dmi:Data> </dmi:DataLocations> </wsa:Metadata> </dmi:SourceDataEPR> <dmi:SinkDataEPR> . . . Similar to above but for the sink . . . </dmi:SinkDataEPR> Defines alternative locations for the data source and/or sink.
DMI cont.. There are some limitations: DMI is intended to describe only a single data transfer operation between one source and one sink. To do several transfers, multiple invocations of a DMI service factory would be required to create multiple DMI service instances. We require a single (atomic) message packet that wraps multiple transfers that can be delivery transacted, e.g. through a message routers. Some of the existing constructs require extension / slight modification. Therefore: DMI v2 strawman proposal at OGF to canvass some new extensions and to propose a new bulk-copy doc that builds on DMI.
Bulk Data Copy Doc and JSDL Integration ? <jsdl:JobDefinition> <jsdl:JobDescription> <jsdl:JobIdentification ... /> <jsdl:Application> <!-- Option a) Embed BulkDataCopy document --> <other:BulkDataCopy ... /> <!-- If Basic Profile compliance is important --> <jsdl-hpcpa:HPCProfileApplication> <jsdl-hpcpa:Executable>/usr/bin/datacopyagent.sh<jsdl-hpcpa:Executable> <jsdl-hpcpa:Argument>‘myBulkDataCopyDoc.xml’</jsdl-hpcpa:Argument> ... </jsdl-hpcpa:HPCProfileApplication> </jsdl:Application> <jsdl:Resources> <!-- Option b) Stage-in BulkDataCopy document --> <jsdl:DataStaging> <jsdl:FileName>myBulkDataCopyDoc.xm</jsdl:FileName> ... </jsdl:DataStaging> </jsdl:Resources> </jsdl:JobDescription> </jsdl:JobDefinition> Possible? options for integrating the proposed <BulkDataCopy/> document within JSDL; a) nesting within the <jsdl:Application/> element or b) staging-in of a <BulkDataCopy/> document as input for the named executable - why not ?