700 likes | 877 Views
Sessions 43 & 44 Accessing data using a common interface: OGSA-DAI as an example Elias Theocharopoulos and Tilaye Alemu ISSGC ‘09 – Sophia Antipolis – Tuesday, 14th July 2009. Overview. The problem: Sharing data in a grid What is OGSA-DAI? Data-centric workflows Key OGSA-DAI terms
E N D
Sessions 43 & 44Accessing data using a common interface: OGSA-DAI as an exampleElias Theocharopoulos and Tilaye Alemu ISSGC ‘09 – Sophia Antipolis – Tuesday, 14th July 2009
Overview • The problem: Sharing data in a grid • What is OGSA-DAI? • Data-centric workflows • Key OGSA-DAI terms • The OGSA-DAI client toolkit • Use cases and extensibility points • Pros and cons
The problem: Sharing and accessing data in a grid
How about a central server? FR query FR data Client
Central server pros and cons • Access to up-to-date data • Single point of access • Data in common format • Database can handle joins • Initial overhead in terms of time, effort and cost • Keeping data up to date • Loss of control by data providers • Assuming they even let go • Security and trust
How about providing direct access? IA query IA data UK query UK data ES query ES data Client Translate and join
Direct access pros and cons • Access to up-to-date data • Fast access • Data providers retain control • Fat clients • Heterogeneity and inconsistency • Data • Databases • Connection • Security • Security overheads for data providers • Manage firewalls and usernames/passwords for multiple clients • Hard to use in grid/web service workflows
UK data ES data IA data How about providing a ZIP on the web? HTTP GET HTTP GET HTTP GET ZIP ZIP ZIP Client UnZIP, translate and join
ZIP on the web pros and cons • Fast access • Data providers retain control • Very large downloads even if client only needs subset • Providers have to select and ZIP their data • Client has to install data into a local database • Static snapshot
OGSA-DAI Sharing distributed heterogeneous resources with OGSA-DAI UK query UK data ES query ES data IA query IA data Translate and join FR data FR query Client
Motivation Grid is about sharing resources Need to share structured data resources 12
What is OGSA-DAI? Open Grid Services Architecture Data Access Integration A framework that executes workflows Workflows are data-centric Workflow components are designed for data access, integration, transformation and delivery Can access heterogeneous data resources Webservice interface Intended as a toolkit for building higher-level application-specific data services 13
OGSA-DAI’s vision • Sharing data resources to enable collaboration • Data access • Structured data in distributed heterogeneous data resources • Data integration • e.g. expose multiple databases to users as a single virtual database • Data transformation • e.g. expose data in schema X to users as data in schema Y • Data delivery • To where it’s needed by the most appropriate means • e.g. web service, e-mail, HTTP, FTP, GridFTP
OGSA-DAI workflow • Executes workflows • Workflows contain activities • Well-defined functional units • Data goes in, something is done, data comes out • Equivalent to programming language methods • Workflows are submitted by clients • To an OGSA-DAI web service
An OGSA-DAI workflow - a simply analogy Convert query from French to English Convert data from English to French Run SQL query SELECT Country, Capital FROM Countries Join the data SELECT Pays,Capital FROM Pays SELECT País, Capital FROM Países Convert data from Spanish to French Convert query from French to Spanish Run SQL query
OGSA-DAI How it appears to the client workflow(SELECT Pays,Capital FROM Pays) Client
Data integration with OGSA-DAI workflows • Across OGSA-DAI services
OGSA-DAI: Key Term Activity An activityis a named unit of functionality A well defined workflow unit Pluggable Composable An activity can have 0 or more named inputs 0 or more named outputs Blocks of data flow from an activity’s output into another activity’s input
OGSA-DAI: Key Term Activity (cont.) Example activities include Execute an SQL query ZIP a batch of data List the files in a directory Execute an XSL transform on an XML document Deliver data to an FTP server
OGSA-DAI: Key Term Activity (cont.) Activity Connections All required inputs must be connected All outputs must be connected Optional inputs Inputs Literal Streamed Types
[byte[]…],[ byte[]..] f1,f2 ReadFromFileActivity Data grouping: Lists • Special blocks are used to mark the beginning and the end of a list. • A list groups related data as one unit. • For example ReadFromFileActivity can dynamically take any number of filenames as input. • Without a way to group the output byte arrays we would have no way to differentiate between the binary data of filenames f1 and f2. • Streaming is preserved since for each file a number of byte arrays is produced to be forwarded to coming activities.
SqlQuery SELECT city, temp FROM weather; Passing data internally: OGSA-DAI Tuple • A special type of data passing between activities • A Tuple is a data representation similar to a row of relational data. Each element of a Tuple represent a column. • Tuples are normally grouped in lists and they are preceded by a metadata block.
[A,B,C,D] [A,B,C,D] TeeActivity [A,B,C,D] No of outputs: 2 An interesting activity: Tee • There are activities that operate on the level of blocks and are not concerned with the type and values of data they are handling. E.g TeeActivity:
OGSA-DAI: Key Term Resource Data request execution resource Data resources Data sources Data sinks Sessions A state container associated with a set of workflows One workflow can lodge state A subsequent workflow can retrieve it Requests One per workflow submitted to a DRER Access request status
OGSA-DAI: Key Term Workflow A workflow can contain: Activities Resource-based: SQLQuery Non-Resource: Transformation and Delivery Resources Targeted by Activities Other Workflows Sub workflows Other types of workflow
OGSA-DAI: Key Term Workflow (cont’) • OGSA-DAI can be used as a workflow processing system that is designed to stream data through a set of activities in a pipelined manner. • In the Query->Transform->Deliver workflow, if the activities are well defined all three will be processing concurrently with different portions of the data stream.
1 2 OGSA-DAI: Key Term Workflow (cont’) • Pipeline workflow consists of a set of chained activities that will be executed in parallel with data flowing between the activities. • Sequence workflow all the sub-workflows added to this workflow will be executed in sequence. For example 1st sub-workflow in a sequence creates a table, 2nd bulk loads transformed data into this table. • Parallel workflow all the sub-workflows added to this workflow will be executed in parallel.
Getting to the first practical: The OGSA-DAI client toolkit.
OGSA-DAI client toolkit • OGSA-DAI client toolkit • Construct and submit requests in Java not XML • Toolkit manages interaction with web services via SOAP over HTTP; it handles SOAP request construction and response parsing. • Provides Java abstractions of • Services • OGSA-DAI resources and properties • Requests • Activities
The client toolkit • The workflow description is sent to the OGSA-DAI server as an XML document. • Application developer does not need to worry about creating this document. • The client toolkit provides ways of assembling activity workflows programmatically. • We will see how to use the client toolkit during the hands-on session.
Service/resource model One Data Resource Data MyDRER Two Data Request Execution Service Data Request Execution Resource Data Resource Data Three Data Resource Data Client Session Session Request Management Service Request MyRequest123456
Client Toolkit Activities • One client activity per server activity • Same input and output names • Plus some convenience methods For example: • Retrieve results as a JDBC ResultSet from a TupleToWebRowSet activity. • Retrieve update count as an Integer from a SQLUpdate activity
Step by Step Guide for Writing Clients • Create activities • There’s a corresponding client toolkit activity for each server-side activity DeliverToFTP deliver = new DeliverToFTP(); ReadFromFile readFile = new ReadFromFile();
Connecting activities • Set inputs for each activity (e.g. parameters) • Every input parameter can either be literal input or streamed from another activity • Literal inputs, e.g. for constant parameters: • Connect input to the output of another activity to stream data deliver.addFilename("results1.txt"); deliver.addHost(“anonymous@test.ogsadai.org.uk:21"); deliver.connectDataInput(readFile.getDataOutput());
Gaining access to the results • If the output of an activity can be provided in a user-friendly type, then there are methods to access the results: • Check whether there are more results to be retrieved • Get the next result in a convenient type boolean hasNext = sqlUpdate.hasNextResult(); int count = sqlUpdate.getNextResult();
Build and execute the Workflow Request • Create workflow and add activities to them • A data service executes the workflow and returns a response (or an error!) • The response may contain data (depending on the activities) • Each client toolkit activity provides utility methods for retrieving its response data
First hands-on session Go to : http://homepages.nesc.ac.uk/~elias/issgc09/html/practical.html
Extending OGSA-DAI: What OGSA-DAI A Framework Extensible Out of the Box is the basics Different applications have different needs New Sources of Data New Functionality
Extending OGSA-DAI: Overview Presentation Layer New Message Frameworks gLite Embedded UNICORE WS-DAI ? GT Axis OMII OGSA-DAI Core Persistence and Configuration Workflow Execution Engine Sessions Activity Framework Request SQLQuery XPathQuery XSLTransform New Functionality DeliverToURL Data Source MyOwnActivity Data Sink New Types of Data Data Resources
Extending OGSA-DAI: Activities Activities do some unit of work Specific transformation Data Format: SWISS-PROT to format X Delivery Deliver to a target service Data analysis and Integration Combine data from different sources
Extending OGSA-DAI: Resources New resources – why? New Products New Applications Specialised Access Required: DataResource DataResourceState ResourceAccessor
Extending OGSA-DAI: Remote Resource Accessing Resources on Remote OGSA-DAI Avoid replication of resources Security Issues Devolved to Local OGSA-DAI Security between OGSA-DAI Deployments
SQL views • Define a drPatient view • SELECT id, name, age, sex, doctor.name as drName FROM patient, doctor WHERE patient.DrID = doctor.ID; • Client runs SELECT * FROM drPatient; • Shorthand for complex query results • Data access control e.g. users of drPatient • Cannot access a patient’s ZIP • Are unaware of the doctor or patient tables
OGSA-DAI SQL views • OGSA-DAI SQL views data resource • Represents a view across a database exposed by an OGSA-DAI relational resource • SQLQuery activity • Parses query • Splices in view definition • Submits transformed query to database • Can define views for read-only databases • Schema transformation • Map a logical schema to a physical schema
Distributed query processing • OGSA-DQP • Developed by Universities of Manchester and Newcastle • Refactored for OGSA-DAI 3.0 by EPCC as part of the NextGrid project • OGSA-DAI DQP package • Multiple tables on multiple databases are exposed to clients as multiple tables in one “virtual database” • Clients are unaware of the multiple databases • Databases can be exposed • EITHER within one OGSA-DAI server • OR via multiple remote OGSA-DAI servers