280 likes | 446 Views
Programming Gridflows using Matrix. Arun Jagatheesan Architect, SDSC Matrix San Diego Supercomputer Center. SDSC Tech Talk SDSC, UCSD. Talk Outline. Where do we need this? Infrastructure-based Execution logic (Concept?) Matrix Project Overview (Who?) Data Grid Language and Programming
E N D
Programming Gridflows using Matrix Arun Jagatheesan Architect, SDSC Matrix San Diego Supercomputer Center SDSC Tech Talk SDSC, UCSD
Talk Outline • Where do we need this? • Infrastructure-based Execution logic (Concept?) • Matrix Project Overview (Who?) • Data Grid Language and Programming • Gridflow Runnable (flowable) • Flow • Gridflow Metadata • ECAA rules • Other benefits • What Next – Straight Talk
Pipeline could be triggered by input at data source or by a data request from user Pipeline could be triggered by input at data source or by a data request from user Data handling pipeline(data information pipeline) Metadata derivation Ingest Data Ingest Metadata Determine analysis pipeline Initiate automated analysis Use the optimal set of resources based on the task – on demand Organize result data into distributed data grid collections All gridflow activities stored for data flow provenance
Generic Gridflow Scenario • Application X, Application Y, Application Z • May be different programming languages, programmers, different execution environments • May be in different grid domains (sites) • Pass data between each other during their execution • SDSC Note: Might use a data grid environment that works!
Example for Generic Gridflow Pattern • Ingest 1 million URLs into digital library using URL Ingestor (or harvestor – App X) • For each URL iterate with 5 parallel execution • Do some processing on the file (App Y) • Store the output file from App Y in a grid disk resource • Replicate a copy of same file in a grid archive resource • Calculate MD5 checksum (App Z) for file in disk • Calculate MD5 checksum (App Z) for file in archive • If checksums mismatch, ingest a metadata warning flag Late binding For each If checksums mismatch Rules Gridflow metadata processing
Traditional way • Write a customized program • Create a common program that can invoke the distributed or localized applications using appropriate client code • Hardwire all the apps (X, Y, Z) together • Have this customized program as the delegator invoking all other applications • Declare the necessary variables, implement the rules/conditions also [like the checksum1 == checksum2]
Why take the Gridflow approach? • What if scenarios… • The infrastructure can run more or less things in parallel • The cyber-infrastructure has more resources for distribution (An app can be run at multiple places for different parameters – parameter sweep distribution) • Different meta-data conditions or milestone • Run this till the molecule changes from green to red (or yellow) • Change in the sequence of execution it self (New app) • Process provenance is required • Any ways, you are not coding/changing your application to fit into the gridflow environment (It’s the other way around) – Make simple changes only in the execution logic…
Infrastructure-based Execution Logic • Each gridflow has different executables • App X, App Y, App Z – Runnable or “Flowable” • How should these flowables be run? • Parallel, Sequential, for-each input item (pipeline), while, switch • Capture this as a Flow • Is there a condition • Run till exit value = 0 or till molecule color changes to red • Are there metadata variables? (color) • Describe this Execution Logic Separately • Loosely coupled, modified without compilation • Use a XML based language
That is why we started Matrix Project • Movie break …. • Language to describe and execute this Infrastructure-based Execution Logic • Software to design, query, run this logic
“Flowable” • Any thing that can Run in a gridflow • Not using Runnable (java) as its taken in Thread paradigms • Any App (single execution of App X, Y, Z) • Any SRB based data grid step (to handle data)
“Flowable” in java ExecuteProcessStep executeMD5 = new ExecuteProcessStep("executeMD5-Metadata", "md5"); executeMD5.setStdOut(new StreamData("$md5Sum", false)); executeMD5.addParameterAsExpression("$locationOfFile");
“Flowable” in DGL <ns1:Step stepID="executeMD5-Metadata"> <ns1:Operation> <ns1:ExecuteProcessOp> <ns1:StdParams> <ns1:exeURI>md5</ns1:exeURI> <ns1:input name="$locationOfFile"> <ns1:string>$locationOfFile</ns1:string> </ns1:input> <ns1:std_out> <ns1:StdStreamData> <ns1:variable>$md5Sum</ns1:variable> </ns1:StdStreamData> </ns1:std_out> </ns1:StdParams> </ns1:ExecuteProcessOp> </ns1:Operation> </ns1:Step>
Data Grid Language (DGL) • XML based gridflow description • Describes execution flow logic • ECA-based rule description for execution • ECA = Event, Condition, Action • Querying of Status of Gridflow • XQuery / Simple query of a Gridflow Execution • Scoped variables and gridflow patterns • For control of execution flow logic
Gridflow Patterns • These basic things can be combined together • E.g. Execute all 9 flowables in parallel • Switch based on color: • Red: App X • Green: App Y • Gridflow Patterns • Sequential, Parallel, For-Each-Parallel, For-each-sequential, Switch, While / MileStone processing
Gridflow Pattern in Java // forEach file in the collectionList, do some processing ForEachFlow forEach = new ForEachFlow("forEachFlow", "file", new CollectionList("$collectionList")); // could also say how many files to be handled in parallel // A DGL (XML) code would be generated
Flow Scoped Variables that can control the flow Logic used by the sub-members Sub-members that are the real execution statements
Gridflow Variable in Java /* create a variable called "collectionList" with an initial value of "empty“. this variable is a string now, but will later be used to hold a CollectionList. This is ok to do because variables are dynamically typed in DGL */ processFilesFlow.addVariable("collectionList", "empty");
Data Grid Request Annotations about the Data Grid Request Can be either a Flow or a Status Query
DGL Requests • Data Grid Flow • An XML Structure that describes the execution logic, associated procedural rules and grid environment variables • Status Query • An XML Structure used to query the execution status any gridflow or a sub-flow at any granular level • A DGL or Matrix client sends any of these to the Matrix Server
Event Publish Subscribe, Notification JMS Messaging Interface Matrix Gridflow Server Architecture JAXM Wrapper WSDL Description SOAP Service for Matrix Clients Matrix Data Grid Request Processor Sangam P2P Gridflow Broker and Protocols Transaction Handler Workflow Query Processor Status Query Handler Flow Handler and Execution Manager XQuery Processor Gridflow Meta data Manager ECA rules Handler Persistence (Store) Abstraction Matrix Agent Abstraction SDSC SRB Agents Other SDSC Data Services Agents for java, WSDL and other grid executables JDBC In Memory Store
Don’t you guys have a group picture? Matrix Folks (Emeritus) • Jonathan Weinberg • Daniel Moore • Allen Ding • Reena Mathew • Erik Vandekieft
Hey, The guy on right is all talk and no walk SRB Java Folks Luke - Jargon Man One man development team Says he works on strategies for SRB Java software
Advantages from SRB Perspective • Reduces the Client-Server Communication • The whole execution logic is sent to the server • Less number of WAN messages • Our experiments prove significant increase in performance • Datagrid Information Lifecycle Management • Autonomic: “Move data at 9:00 PM in weekdays and in week ends” • Data Grid Administration • Power-users and Sophisticated Users • Data Grid Administrator (Rules to manage data grid) • Scientist or Librarian (Visualized data flow programming)
Using DG-Modeler • GUI for dataflow programming
Gridflow Process I (Vision) Gridflow Description Data Grid Language End User using DGBuilder
Planner Concrete Gridflow Using Data Grid Language Gridflow Process II (Vision) Abstract Gridflow using Data Grid Language
Gridflow Processor Concrete Gridflow Using Data Grid Language Gridflow Process III (Vision) Gridflow P2P Network
got ideas/suggestions?Contact: SDSC Matrix project arun@sdsc.edu Google key word: SDSC Gridflow Click here to start the slide show again