420 likes | 564 Views
FEDERATED QUERY PROCESSOR AND WORKFLOW. KC Workshop 2011 CITIH, BMI, OSU February 2011. Federated Query Processor: Agenda. Overview Why FQP is needed: OPRN use case Federated Query Service Asynchronous query execution Credential Delegation Large results retrieval
E N D
FEDERATED QUERY PROCESSOR AND WORKFLOW KC Workshop 2011 CITIH, BMI, OSU February 2011
Federated Query Processor: Agenda • Overview • Why FQP is needed: OPRN use case • Federated Query Service • Asynchronous query execution • Credential Delegation • Large results retrieval • Configurable Query Behavior • Federated Query Workflow • Example workflow • Sample DCQL query for OPRN use case • Federated Query Engine API • Federated Query Processor Client API • Federated Query Results Client API
Use Case: Ohio Perinatal Research Network (OPRN) DATA CONSIDERATIONS: The clinical data for these groups are discretely available via the NCH PDRM and OSUMC Information Warehouse (IW), respectively, will be identified, and: Perinatal Research Data Marts (PRDMs) comprised of these data elements will be set up within the NCH data warehouse and OSUMC Information Warehouse Translational Research Informatics and Data Management Grid (TRIAD) grid node will be instantiated at NCH PRDM will be modeled and semantically annotated ACTORS: Clinician/PIs Clinical Research Coordinators Biomedical Informatics Directors Bio-pathology Directors EMR Information Systems Personnel OPRN Program Staff Figure 1. Overview of currently independent perinatal research efforts at OSUMC and NCH Figure 2. Overview of use case workflow
Federated Query Processor (FQP): Overview The caGrid Federated Query Processor Infrastructure provides a mechanism to: perform: • Basic distributed aggregations • Joins (inner joins) of queries over multiple, extraneous data services using: • DCQL (an extension to CQL): submitted to FQP by a client • A uniform query language, CQL: submitted by FQP to each service being queried • A uniform grid interface enables: • Queries over any combination of caGrid data services • Express such concepts as joins against other data services, aggregations, and target data services
Federated Query Service • Asynchronous query execution • Start a DCQL query and immediately return the results context • Results retrieved by the client later • WS-Notification allows client to subscribe to status • Processing complete, exception, querying target data service, etc. • Credential Delegation • FQP service may perform queries on behalf of a client • Leverages Credential Delegation Service (CDS) • Large results retrieval • WS-Enumeration and caGrid Transfer • Configurable Query Behavior • Failure handling, partial results retrieval, etc
Federated Query Service Security • FQP Service may be deployed securely • May employ HTTPS messaging to and from client • Can communicate securely with data services involved in a DCQL query if they employ security • Unless otherwise specified, host credential of the FQP service will be used • Credential Delegation Service (CDS) may be used to query such services on behalf of the client • Required for use of CDS • Caller – only query results resources • Asynchronous query operations create a resource which is only accessible to the original caller of the asynchronous operation • Resource configured to use Grid Map authorization and caller’s ID is the only entry in that Grid Map
Federated Query Service Security • Delegated credentials • A client may choose to delegate his credentials to the FQP service for secure communication with data services involved in a DCQL query • Client makes a delegation request to the CDS • Indicates the FQP service as the delegatee • CDS issues a Delegated Credential Reference • Indicates which CDS holds the credential and a unique key • The FQP service validates those credentials • The caller’s ID must match the identity contained in the credentials obtained from the CDS and be non-null • Subsequent calls to data services are executed securely, utilizing the client’s delegated credential • Secure grid calls now utilize caller’s ID in all cases – no anonymous queries
Federated Query Resource Properties • The FQP Results Service publishes execution status through a resource property • Contains the processing status of the DCQL query • Waiting, Processing, Complete, Complete with error • Indicates which target data services have been contacted and what the status of that query execution was • Completed, Connection problem, other Exception • Identifies the range in the result set which was generated by each target data service • Even in aggregation scenarios, it is possible to poll this value and identify the context for each result
Federated Query Resource Properties • The processing status resource property supports WS-Notification • Notification allows the service to “push” information to the client, typically in response to some server-side event • Clients may subscribe to this resource property and receive updates when it changes • Allows clients to wait for processing to complete without continuously polling the isProcessingComplete() method if the FederatedQueryResultsClient • Can be used to keep clients appraised of current query status • Progress bar, status box, logging, etc.
Federated Query Workflow Example: • A client wishes to submit a query to the FQP service • Query accesses multiple target data services • One or more of which is transient on the grid and might not be available at query time • Involves secure data services which require authentication from the client • Client wishes to use his own credentials, so the query returns data he is authorized to see • The client will retrieve the results of the query at a later time • The client does not want to busy-wait on the FQP service so it can go do other processing tasks
Federated Query Workflow: • Client delegates credential to FQP using CDS • Client initiates Asynchronous DCQL query • Passes credential reference from CDS • Creates Query Execution Parameters to allow partial results • FQP Contacts CDS for the user’s credential • Credential is validated for same identity as caller • FQP creates a query results resource • Reference to the resource is returned • Query processing begins using the thread pool • Generated CQL is broadcast to involved data services • Client subscribes to the results resource status property • FQP completes processing and notifies the client • Client retrieves results from the resource
Federated Query Workflow: Client • In this example, a query is started using the delegated credential and query execution parameters features • The client’s credential is loaded • The delegation parameters are specified • Path length, delegatee, etc • The CDS is contacted using the special delegation user client to delegate the credential according to the parameters • The FQP service client is created using the client’s credential • A DCQL query is created or loaded • Query execution parameters specify 10 retries, 30 seconds apart if a target service fails • The query is executed asynchronously, passing the delegation reference and the query execution parameters
Federated Query Workflow: Client (contd.) • Client loads delegation credentials and configures delegation parameters: • //load client credential • GlobusCredential proxyCredential = new GlobusCredential(userCertFile.getAbsolutePath(), userKeyFile.getAbsolutePath()); • //configure delegation lifetime, path length, policy etc… • ProxyLifetime delegationLifetime = new ProxyLifetime(); • ProxyLifetime issuedCredentialLifetime = new ProxyLifetime(); • //credential may not be further delegated • int issuedCredentialPathLength = 0; • //key length of the credential • int keySize = ClientConstants.DEFAULT_KEY_SIZE; • //path length only needs to be 1 • int delegationPathLength = 1; • IdentityDelegationPolicy policy = Utils.createIdentityDelegationPolicy(Collections.singletoneList(“fqp identity”));
Federated Query Workflow: Client (contd.) • Client contacts the CDS using the special delegation user client to delegate the credential according to the parameters: • //delegate the credential • DelegationUserClientcdsClient = new DelegationUserClient(cdsUrl, proxyCredential); • DelegatedCredentialReferencecredentialRef = cdsClient.delegateCredential(delegationLifetime, delegationPathLength, policy, issuedCredentialLifetime, issuedCredentialPathLength, keySize);
Federated Query Workflow: Client (contd.) • The FQP service client is created using the Client’s credential: • //create the FQP client • FederatedQueryProcessorClient client = new FederatedQueryProcessorClient(fqpUrl, proxyCredential); • Query execution parameters are specified: • //execution parameters – retry 10 times, 30 second timeout • QueryExecutionParameters execParams = new QueryExecutionParameters(); • TargetDataServiceQueryBehavior behavior = new TargetDataServiceQueryBehavior(); • behavior.setFailOnFirstError(Boolean.FALSE); • behavior.setTimeoutPerRetry(Integer.valueOf(30)); • execParams.setAlwaysAuthenticate(Boolean.TRUE); • execParams.setTargetDataServiceQueryBehavior(behavior);
Federated Query Workflow: Client (contd.) • A DCQL query is created or loaded: • //create a query (more on next slide) • DCQLQuery query = new DCQLQuery(); • //execute the query asynchronously • FederatedQueryResultsClientresultsClient = client.query(query, credentialRef, execParams);
Federated Query Workflow: Client (contd.) • Creating a DCQL query for the OPRN use case: <?xml version=“1.0” encoding=“UTF-8”?> <DCQLQuery xmlns:cql=“http://CQL.caBIG/1/gov.nih.nci.cagrid.CQLQuery” xmlns=“http://caGrid.caBIG/1.0/gov.nih.nci.cagrid.dcql”> <TargetObject name=“com.example.Person”> <Group logicRelation=“AND”> <ForeignAssociation targetServiceURL=“https://somehost:8443/wsrf/services/cagrid/SpecimenRegistry”> <JoinCondition localAttributeName=“ssn” foreignAttributeName=“patientSSN” predicate=“EQUAL_TO”/> <ForeignObject name=“org.example.Specimen”> <Attribute name=“sampleid” value=“103” predicate=“GREATER_THAN”/> </ForeignObject> </ForeignAssociation> <Attribute name=“lastName” value=“Foo%” predicate=“LIKE”/> </Group> </TargetObject> <targetServiceURL>https://anotherhost:8443/wsrf/services/cagrid/PersonRegistry </targetServiceURL> </DCQLQuery>
Federated Query Workflow: Handling Results • Three ways to handle results: • Simple SOAP transfer • WS-Enumeration • caGrid Transfer
Federated Query Engine API (FQE API) • Core component of the FQP • Can be used as a • stand-alone API or, • within the context of the FQP grid service
FQE API: Constructing an Instance • There are two constructors for the Federated Query Engine: • public FederatedQueryEngine(GlobusCredential credential, QueryExecutionParameters executionParameters) • public FederatedQueryEngine(GlobusCredential credential, QueryExecutionParameters executionParameters, ExecutorService workExecutor) • The three parameters can be used as follows: • credential • A Globus client credential can be passed along to the Federated Query Engine, and will be used to query secure data services involved in any DCQL queries issued to the engine. • executionParameters • Query Execution Parameters (described later) allow the user to define how they'd like the engine to behave with respect to things like target data service failures, retries, and timeouts. • workExecutor • The Federated Query Engine will perform query related tasks in threads. This has the benefit of potentially greatly speeding up the final stage of query processing, which involves broadcasting the final CQL query generated by the engine to all target data services specified by the DCQL query. The Executor Service passed in through this parameter allows users to control the way those threads are allocated and managed.
FQE API: Query Execution Parameters • QueryExecutionParameters: Control various aspects of query execution. • TargetDataServiceQueryBehavior: Controls how query engine handles failure conditions when submitting CQL to target data services specified in DCQL query. • Three properties of TargetDataServiceQueryBehavior data type: • failOnFirstError • Type: Boolean • If this property is set to true, the other two properties are meaningless. • This property controls how the query engine handles failures while querying target data services. • If set to true, the engine will terminate query processing and throw an exception when querying against any target data service fails for any reason. No query results will be returned. • If set to false, the other two parameters are used to determine how to handle the failure, and a partial result set may be returned. • retries • Type: Integer • This property specifies the number of times the query engine will retry a query against a target data service if it fails to execute. • timeoutPerRetry • Type: Integer • This property specifies the number of seconds the query engine will wait before retrying a query against a target data service if it fails to execute.
FQE API: Public API Methods • Simple Query Execution The execute method takes a single DCQL query parameter and returns a single DCQLQueryResultsCollection instance. This method may throw a FederatedQueryProcessingException. public DCQLQueryResultsCollection execute(DCQLQuerydcqlQuery) throws FederatedQueryProcessingException • Execute and Aggregate Results The executeAndAggregateResults method takes a single DCQL query parameter and returns a single CQLQueryResults instance. This method may also throw a FederatedQueryProcessingException. public CQLQueryResultsexecuteAndAggregateResults(DCQLQuerydcqlQuery) throws FederatedQueryProcessingException
FQE API: Federated Query Processing Exceptions These exceptions may be thrown from either public API method when something goes wrong in the course of processing a DCQL query. Several common causes of this exception are: • Failure of a data service involved in the DCQL query • Failure handling behavior for target data services is controllable by the Query Execution Parameters used to construct the Federated Query Engine • Invalid CQL is passed along to a data service (typically due to invalid DCQL originally) • Bad / unrecognized user certificate
FQE API: Query Processing Status Listeners • addStatusListener • Parameter: A processing status listener to be added to the engine's list of listeners • Returns: none • Adds a status listener instance to the list of listeners which will be notified of various query processing events • getStatusListeners • Parameter: none • Returns: An array of status listeners which are registered to the engine • removeStatusListener • Parameter: A processing status listener to be removed from the engine's list of listeners • Returns: boolean true if the listener was found and removed, false otherwise
Federated Query Processor Client API (FQPC API) • The Federated Query Processor Client is the client-side API for communicating with the caGrid Federated Query Processor Service.
FQPC API: Constructing an Instance • There are four constructors for the FQPC Client API: • public FederatedQueryProcessorClient(String url) throws MalformedURIException, RemoteException • public FederatedQueryProcessorClient(String url, GlobusCredential proxy) throws MalformedURIException, RemoteException • public FederatedQueryProcessorClient(EndpointReferenceTypeepr) throws MalformedURIException, RemoteException • public FederatedQueryProcessorClient(EndpointReferenceTypeepr, GlobusCredential proxy) throws MalformedURIException, RemoteException • Parameters: • url parameter passed in the first two constructors is the URL of the Federated Query Processor Service you wish to connect to. • epr parameter in the last two constructors is an Axis Endpoint Reference which resolves to the FQP service you wish to connect to. • proxy parameter is a Globus Credential Proxy which you may use to authenticate to and communicate securely with the FQP service.
FQPC API: Connecting to a Secure FQP Service • Introduce generated client code can utilize a certificate to communicate with its service securely. • Connection process: • Client makes a call to the service • Client checks security metadata • This tells the client how to configure itself to properly communicate with service • Notes: • By default, Introduce-generated clients will connect anonymously to methods that allow both anonymous and non-anonymous access. • If client wants to use its credentials to invoke a method: • Set the client to not connect anonymously client.setAnonymousPrefered(false); • This forces client to use its own credentials to communicate with service • Client will always connect using its credentials unless reset to connect anonymously to methods that allow anonymous access • Because methods may change the way they work based on who they are talking to • Can change client credentials (proxy) client.setProxy(newCredentials);
FQPC API: Public API Methods • Simple Query Execution The execute method takes a single DCQL query parameter and returns a single DCQLQueryResultsCollection instance. This method may throw a FederatedQueryProcessingException public DCQLQueryResultsCollection execute(DCQLQuerydcqlQuery) throws RemoteException, FederatedQueryProcessingFault • Execute and Aggregate Results The executeAndAggregateResults method takes a single DCQL query parameter and returns a single CQLQueryResults instance. This method may also throw a FederatedQueryProcessingException. public CQLQueryResultsexecuteAndAggregateResults(DCQLQuerydcqlQuery) throws RemoteException, FederatedQueryProcessingFault
FQPC API: Public API Methods • Asynchronous Query Execution The executeAsynchronously method takes a single DCQL query parameter and returns a single Federated Query Results Client instance. This method may also throw a Malformed URI Exception, and a Remote Exception. public FederatedQueryResultsClient executeAsynchronously(DCQLQuery query) throws RemoteException, org.apache.axis.types.URI.MalformedURIException
FQPC API: Public API Methods • Specialized Query Execution The Federated Query Processor Client offers an API to perform specialized DCQL queries in an asynchronous fashion. public FederatedQueryResultsClient query(DCQLQuery query, DelegatedCredentialReference delegatedCredentialReference, QueryExecutionParameters queryExecutionParameters) throws RemoteException, org.apache.axis.types.URI.MalformedURIException, FederatedQueryProcessingFault, InternalErrorFault • It takes three parameters: • query • The DCQL query to execute on the server • delegatedCredentialReference • A reference to a delegated credential. The Federated Query Processor service will execute queries agaisnt data services involved in the DCQL query using the delegated credential. This allows clients to perform queries against secure data services using their own credentials for authentication and authorization. • This parameter may be null • queryExecutionParameters • Parameters which control the behavior of the query processor with respect to various failure and retry conditions, especially for target data services. • This parameter may be null.
Federated Query Results Client (FQRC) • The Federated Query Results Client is a caGrid service client which can retrieve results and information about the current state of query processing from a Federated Query Processor Service which has been previously issued a query. • The Federated Query Results Client has the same constructors that any standard Introduce generated service client would have, but only those constructors which take an Endpoint Reference Type should be used, since EPRs contain the necessary resource key to access the server-side query results resource.
FQRC: Public API Methods • FQRC supplies standard WS-ResourceLifetime methods • Can be used to set resource termination time • Can be used to dispose resource immediately • Other methods specific to the FQPC: • public boolean isProcessingComplete() throws RemoteException This method simply returns true if the Federated Query Processor Service has completed execution of the original DCQL query and results are available, or false otherwise. • public DCQLQueryResultsCollection getResults() throws RemoteException, ProcessingNotCompleteFault, FederatedQueryProcessingFault, InternalErrorFault This method gets the DCQL query results from the resource to which the Federated Query Results Client is connected. If processing has not yet completed (as indicated by the isProcessingComplete method), this will throw a Processing Not Complete Fault. Problems encountered while processing the query will cause a Federated Query Processing Fault to be thrown.
FQRC: Public API Methods (contd.) • Other methods specific to the FQPC (contd.): • public CQLQueryResults getAggregateResults() throws RemoteException, FederatedQueryProcessingFault, ProcessingNotCompleteFault, InternalErrorFault This method behaves very similarly to the Federated Query Engine's executeAndAggregate method. It gets the DCQL query results as a single, aggregate CQL Query Results instance which can be processed further with the standard data service tools. • public EnumerationResponseContainer enumerate() throws RemoteException, FederatedQueryProcessingFault, ProcessingNotCompleteFault, InternalErrorFault This method allows a client to make use of WS-Enumeration to retrieve results of DCQL query processing via an Enumeration client. • public TransferServiceContextReference transfer() throws RemoteException, FederatedQueryProcessingFault, ProcessingNotCompleteFault, InternalErrorFault This method allows a client to use caGrid's Transfer tools to retrieve results of DCQL query processing via the Transfer client tools. This can help with the speed of large result set retrieval.
References • CQL 2: http://cagrid.org/display/dataservices/CQL+2 • DCQL 2: http://cagrid.org/display/fqp/DCQL+2 • FQP APIs: http://cagrid.org/display/fqp13/Developers+Guide#FederatedQueryResultsClient-APIMethods