10 likes | 94 Views
Using WSRF to build workflow scripts: Temp “files” and filters in a Grid environment Michael Grobe Indiana University 1 Introduction
E N D
Using WSRF to build workflow scripts: Temp “files” and filters in a Grid environment Michael Grobe Indiana University 1 Introduction Programmers frequently build scripts that store data in local temporary files, and sometimes pass the file handles of those files to subroutines or other programs. They also build “filter pipelines” that link simple applications in sequences that perform more complex processes. This poster demonstrates the use of the same techniques to orchestrate workflow within a grid environment built using the Web Services Resource Framework (WSRF). It includes the source code for a simple WSRF Resource that may be used for temporary storage, a WSRF service that uses the temporary storage Resource, and outlines a script to use those Resources. This approach to workflow control will be demonstrated by using the WSRF::Lite Perl module developed at the University of Manchester [8] along with the WSRF container included in the Perl distribution as Container.pl. Other WSRF containers, such as the one provided within the Globus Toolkit, could be used as well. The target Grid-based application for the major example in this paper is the Centralized Life Sciences Data (CLSD) service at Indiana University [3]. CLSD presents a collection of life science data converted to relational form and/or federated into a single relational database managed by an IBM DB2 database management system. 2 The Filespace Resource Here is a WSRF "Filespace Resource" that has only one property accessible to users, and only one (non-built-in) operation. The property is named "array," and the user-supplied operation is “createFilespaceResource”. #!/usr/bin/env perl package Filespace; use strict;use vars qw(@ISA); use WSRF::Lite; @ISA = qw(WSRF::FileBasedResourceLifetimes); # Identify superclass. # The property "array" is an array of things. $WSRF::WSRP::ResourceProperties{array} = [] ; $WSRF::WSRP::PropertyNamespaceMap->{array} {prefix} = "mmk"; $WSRF::WSRP::PropertyNamespaceMap->{array} {namespace} = "http://www.sve.man.ac.uk/Filespace"; # operation to create a new Filespace resource. sub createFilespaceResource { my $envelope = pop @_; my ($class, @params) = @_; my $ser = new WSRF::SimpleSerializer; # Get an ID for the Resource. my $ID = WSRF::GSutil::CalGSH_ID(); # Create a WS-Address for the Resource. my $wsa = WSRF::GSutil::createWSAddress( module => 'Filespace', path => 'Session/Filespace/', ID => $ID ); # Write the properties defined above to file. WSRF::File::toFile($ID); # Return the WS-Address. return WSRF::Header::header( $envelope ), SOAP::Data->value($wsa)->type('xml'); } # end sub createFilespaceResource1; 1; Filespace.pm is a modified version of the version of Mark McKeown's Counter resource that inherits from the FileBasedResourceLifetimes class, which stores resource properties in a file. The createFilespaceResource operation returns an address or EndPoint Reference (EPR) of a newly created Resource. Invocations of createFilespaceResource, in particular, and WSRF Resources, in general, take the following form: # Define the location and URI of the service. $WS_uri = "http://host.domain/Filespace"; $WS_target = "$WS_host:$WS_port/Session/Filespace/Filespace"; # Now create a Filespace WS-Resource. my $ans = WSRF::Lite -> uri( $WS_uri ) -> wsaddress( WSRF::WS_Address->new()-> Address( $WS_target ) ) # Specify address. ->createFilespaceResource(); # Invoke function. This usage will return an EPR like: http://host.domain:8422/ Session/Filespace/Filespace/53101852107163019937 which is a combination of the Internet address of the service and (part of) the name of a file to be used for storage of Resource properties. (WSRF::Lite also allows Resource state to be stored in active processes, in which case the returned EPR is a socket to a process.) The identified Resource will have a default "Lifetime," which can be adjusted via the built-in SetResourceLifetime operation. (Note, however, that Resource storage may be provided on a scratch disk that is periodically cleaned out without respect to Resource Lifetime settings, and, in fact, may not be large enough for the intended use.) In general, scripts that use the Filespace Resource will: - create an instance of the Resource, - store data in the Resource by using a built-in method like SetResourceProperties, - reset the termination time (TT) of the Resource by using the built-in method SetResourceLifetime, - fetch data from the resource by using a method like the built-in GetResourceProperty method, and - destroy the resource by using the built-in Destroy method. 3 Using WSRF within workflows This section describes a WSRF Resource named "CLSDtoResource" that sends an SQL query to CLSD and stores the results in the Filespace Resource whose address, or EPR, is specified in the input parameter list. (This is exactly analogous to sending a filehandle to a local subroutine or function within a standard script.) CLSDtoResource takes five input parameters: - an address (EPR) for a Filespace Resource, - the name of a property within that Resource that can be used to store the results of a CLSD query, - an SQL query to send to CLSD for processing, - the starting row to return, and - the number of rows to return. Figure 1 shows the source code for the CLSDtoResource package containing the CLSDtoResource subroutine. Note that this subroutine actually interacts with CLSD through a (non-WSRF) Web Service accessed using SOAP::Lite, which returns a single string containing data in Comma-Separated Value (CSV) format. A script using the CLSDtoResource and Filespace Resources as part of a workflow would: - create a Filespace Resource, - pass the EPR for that Filespace Resource to CLSDtoResource, which will query CLSD and store the result in the Filespace Resource, - retrieve and print the contents of the Filespace Resource, and - destroy the Filespace Resource. Figure 2 shows an abbreviated version of a script that would execute these steps. If this script is located within a file named "use-CLSD.pl", it can be used to send the SQL command select tabschema, tabname from syscat.tables to CLSD and retrieve only the first 4 rows as shown below: bash$ perl use-CLSD.pl "select tabschema, tabname from syscat.tables" 1 4 Sending request to create temporary WS-Resource. Successfully created Filespace Resource: http://host.subdomain.indiana.edu:8422/ Session/Filespace/Filespace/53101852107163019937 Sending SQL command to the CLSDtoResource Resource: select tabschema, tabname from syscat.tables Getting the value of the array property... Here is the result of the query: "TABSCHEMA (VARCHAR)","TABNAME (VARCHAR)" "BIND ","BIND_INTERACTION" "BIND ","BIND_PATHWAY" "DB2INST2","ADVISE_INDEX" "DB2INST2","ADVISE_WORKLOAD" Destroying WS-Resource. Temporary Filespace Resource destroyed. bash$ Of course a simple flow like this does not really require a temporary storage resource, but one can easily imagine more complicated scenarios. For example, the data retrieved from CLSD might be passed on to other resources for statistical processing and/or constructing graphs or tables. In other cases, a resource invoked via this technique might involve batch processing, so that the Filespace resource would have to be polled until process completion. The second element in the array property could be used for this purpose, or some completion flag could be added to the current version of the Resource. Note that WSRF::Lite provides some support for WSRF security [6,9], so that messages may be transmitted securely and authentication may be required when invoking remote services. Note also that the current implementation of file-based Resources is not efficient for large (over ~100MB) file storage and/or manipulation, but that Resources could be customized for better efficiency. 4 Discussion This approach to controlling workflows can be used by scripts running on desktops, as CGI scripts, Web Services, etc. It employs the Grid as a network-based computing utility. However, error-checking and failure recovery will add significant complexity to these workflow scripts, so that workflow engines, such as Taverna[7], may prove to be more practicable platforms. Fig. 1. The CLSDtoResource package. package CLSDtoResource; use strict; use vars qw(@ISA); #use WSRF::Lite +trace => debug => sub {}; use WSRF::Lite; @ISA = qw(WSRF::FileBasedResourceLifetimes); # This sub queries CLSD and stores results in the Filespace # Resource whose address is submitted as an input parameter. sub CLSDtoResource { # Process input parameters sent from the WSRF Container. my $envelope = pop @_; my ($class, @params) = @_; my $Filespace_epr = $params[0]; my $property_name = $params[1]; my $my_query = $params[2]; my $starting_row = $params[3]; my $number_of_rows = $params[4]; # Set up and make the call to CLSD using SOAP::Lite. my $host = "host.subdomain.indiana.edu"; my $CLSD_return_value = SOAP::Lite -> service("http://$host:8421/axis/CLSDservice.jws?WSDL"), -> proxy("http://$host:8421/axis/CLSDservice.jws?wsdl", timeout=>1200 ) -> queryCLSD($my_query, $starting_row, $number_of_rows, "DB2account", "account_password", "csv"); # Embed the returned information within appropriate XML. my $insertTerm = "<wsrp:Update><$property_name>“ . $CLSD_return_value . "</$property_name></wsrp:Update>"; # Now store the results in the Filespace WS-Resource. my $ans = WSRF::Lite -> wsaddress( WSRF::WS_Address->new() ->Address( $Filespace_epr ) ) -> uri( $WSRF::Constants::WSRP ) -> SetResourceProperties( SOAP::Data->value( $insertTerm )->type( 'xml' ) ); # Invoke built-in SetResourceProperties function. if( $ans->fault ) { die "ERROR: " . $ans->faultcode." \n" . $ans->faultstring."\n"; } # return envelope. return WSRF::Header::header( $envelope ), "ok"; } # end sub CLSDtoResource Fig. 2. Outline of command-line client to access CLSD using WSRF. # Define the location and URI of the Filespace service. $WS_Filespace_host = "http://host.subdomain.indiana.edu"; $WS_Filespace_port = "8422"; $WS_Filespace_uri = "http://host.subdomain.indiana.edu/Filespace"; # 1. Create a Filespace Resource. # 2. Send the EPR and SQL query to CLSDtoResource. # CLSDtoResource will relay the SQL command to CLSD # via JDBC, and place the result in the "array" # property of the temporary Filespace Resource. # 3. Get the contents of the Filespace Resource. # Get the data from the "array" property of the # Resource created above, and print it. # 4. Destroy the Filespace Resource. (See handout for details.) Acknowledgments Thanks to Mark McKeown of the University of Manchester for several fine tutorial slidesets and example applications. Thanks to Stephan Zasada for carefully explained presentations on security within WSRF::Lite. Thanks to Andy Arenson and Scott McCaulay for providing the opportunity to prepare this paper. References [1] Foster, Ian, et al. “The Open Grid Systems Architecture, Version 1.5”, 2006, http://www.ggf.org/documents/GFD.80.pdf [2] Globus Alliance, The Globus Grid Toolkit Homepage, http://www.globus.org/toolkit/ [3] Indiana University, The Centralized Life Sciences Data (CLSD) Service, http://rac.uits.iu.edu/clsd/ [4] McKeown, Mark, “Web Services for the Grid—WSRF and WSRF::Lite”, 2005, http://www.sve.man.ac.uk/Research/AtoZ/ILCT/cern.ppt, [5] McKeown, Mark, “Web Services for Grid Computing”, 2006, http://www.sve.man.ac.uk/Research/AtoZ/ILCT/ogsa-workshop [6] McKeown, Mark and Stephan Zasada, “Build Secure WS-Resource with WS::Lite and WS-Security”, 2006, http://www-128.ibm.com/developerworks/edu/gr-dw-gr-buildsecure.html [7] Open Middleware Infrastructure Institute, The Taverna project, http://www.omii.ac.uk/projects/display_project.jsp?projectid=76 [8] Open Middleware Infrastructure Institute, “WSRF::Lite – An Implementation of the Web Services Resource Framework”, http://www.sve.man.ac.uk/Research/AtoZ/ILCT [9] Zasada, Stephan, “Investigating Security in Perl-based Grid Middlewares”, 2004, http://www.sve.man.ac.uk/Research/AtoZ/ILCT/stefans_msc.pdf,