270 likes | 398 Views
Model of a real workflow. And issues to discuss…. PlasmoDB workflow. P.Falciparum Standard genome. P.Vivax Standard genome. P.Yoelli Standard genome. P.Berghei Standard genome. P.Chabaudi Standard genome. P.Knowlesi Standard genome. P.Reichonowi Standard genome.
E N D
Model of a real workflow And issues to discuss…
PlasmoDB workflow P.Falciparum Standard genome P.Vivax Standard genome P.Yoelli Standard genome P.Berghei Standard genome P.Chabaudi Standard genome P.Knowlesi Standard genome P.Reichonowi Standard genome P.Gallenacium Standard genome P.Falciparum Non-standard synteny
Subflows (double line) Global steps (oval) Standard Genome Workflow Calculate Translated protein Splign NRDB taxonomy SO TIGR TGI In: Pf, Pk In: Pf, Pb, Py, Pv Genome Compile time Include/Exclude Extract genomic sequence Copy genomic seqs To cluster Extract proteins Copy proteins To cluster molecular Weight Min/max molecular weight Isolelectric point run TMHMM blastx Nrdb genome psipred blastp nrdb proteins Load TMHMM
Standard Genome Workflow Calculate Translated protein Splign NRDB taxonomy SO TIGR TGI In: Pf, Pb, Py, Pv In: Pf, Pk Genome Extract genomic sequence Copy genomic seqs To cluster Extract proteins Copy proteins To cluster molecular Weight Min/max molecular weight Isolelectric point run TMHMM blastx Nrdb genome psipred blastp nrdb proteins Load TMHMM
NRDB NRDB resource Copy from download site Shorten defline Copy to cluster Copy to cluster
Resources acquire unpack ext db Ext db rls insert
Psipred create psipred Data dir fix protein IDs For psipred create psipred Task dir copy Data Dir to cluster copy psipred Protein file to cluster start psipred On cluster wait for cluster copy psipred Files from cluster fix psipred File names make Alg Inv load psipred
BLAST Create Similarity dir Start blast Wait for cluster Copy files From cluster extract IDs From Blast result Optional step (runtime test) Load Subject subset Load Result
Splign Extract query Sequence Alt defline Extract subject Sequence Alt defline runSplign insertSplign
Graph file-- features -- • Workflow xml file • Subflows • Parameters • Constants • Interpolating variables • Global steps • Steps that are only executed once by the whole workflow, even if in multiple subflows • Declare a namespace? • Include/exclude • Compile time inclusion/exclusion • If not compiled in, flow passes right through • Skip-able steps • Runtime exclusion, based on a dynamic test
Graph file-- sharing across projects -- • Live in svn: ApiCommonData/Load/lib/xml/workflow • Found by system in $GUS_HOME/lib/xml/workflow • Shared across all projects • Use include/exclude to specify project specific functionality • Therefore, each build must be on its own branch, to avoid interference
Graph file-- step values -- • Avoid side effects in file system (ok in database) • All files shared by steps must be passed as param values • outputFiles • inputFiles • Avoid hard-coded values • Use Constants • Avoid hand-coded values that change each build • Must be computed by step • Eg blast Y= value • External Db Rls values • Always pass external db rls spec, eg • Plasmodium Falciparum Chromosomes:2008-07-13 • Upgrade steps to conform to this • Table names • Want to be able to reuse these values across steps • Always use same format, eg: • Dots.ExternalNaSequence
Graph file -- cluster -- • Wait for cluster step • Sends email • (takes list of email addresses as config. Maybe we should set up mailing list?) • Followed by a waitForHuman step. • By default is in “WAIT_FOR_HUMAN” state • Orthogonal to other states and offline status • Pilot can turn that off, and it will run
Graph file-- resources pipeline -- • We still use a resources.xml file • Needed by the front end • Pubmed • Descriptions • Data sources and attributions • Handled by a regular subflow • Only one unpack step • Current multiple unpack steps need to be combined into a simple script • Dedicated step classes: • ApiCommonData::Load::Step::AcquireExternalResource • ApiCommonData::Load::Step::UnpackExternalResource • ApiCommonData::Load::Step::InsertExternalDatabase • ApiCommonData::Load::Step::InsertExternalDatabaseRelease • ApiCommonData::Load::Step::InsertExternalResource • Are subclasses of ApiCommonData::Load::Step::AcquireExternalStep • Knows how to parse the resources.xml file
Configuration files • Steps Configuration • Global • Commonly used properties • Not validated until runtime • Static • Defined per step class • Convenient, often all is necesssary • Cascading? • Multi-steps file • Distinguish between stable properties and mutable ones • Version numbers often change • Svn? • Pilot configuration?
Runtime File & Directory Structure • Avoid side-effects • Use explicit input/ouputparams in xml file • Move to a nested data directory structure? /files/cbil/data/cbil/Plasmodb/5.5/workflow/data/ Seqfiles/ nrdb.fsa Pvivax/ Seqfiles/ Psipred/ Assembly/ ESTs/ Initial/ Intermediate/ • Would use the namespace attribute, somehow • Use path statement, eg: • ../ • ../tmhmm • Steps directories • Use nested structure for subflows?
External Files Repository • Do we need it? • If so, what needs to be improved?
Documentation of the workflow • Workflow must be able to run in “documentation” mode • Doesn’t run any steps • Instead, produces documentation as expected by front end • Methods xml file • Resources xml file
GUI • Should it run in the web context? • Security issues • Avoids having to have installed software • Would work from home • All members of team could see the flow • Somehow restrict editability • Could be posted on real site as documentation? • Overkill? Too detailed? • Needs to handle subflows • Subflow node needs to show a summary of what is going on inside the subflow • Multi-colored, to show various states inside it • Gray out paths that are offline • Expand/collapse?
Mini-flows • like mini-pipes, but for workflows…
Standard resources taxonomy SO EnzymeDB GO Codes GO NRDB dbEST [tax_id] Bibliographic Ref terms MO terms MO types MO MO Entry InterPro Orthomcl phyletic orthomcl
Plasmodb resources IEDB epitopes IEDB dbxrefs NA Genbank dbrefs AA Genbank dbrefs pdb Pdb index
P.falciparum resources Watanabe Pf transcripts Watanabe Pf ESTs Zhang ESTs Florent ESTs Pf plastid Pf mitochon Pf GO Associations Sanger IT SNPs SU SNPs Broad SNPs Combined SNPs Winzeler Genetic Var. array MTC KI Array Winzeler Cell Cycle Winzeler Gametocyte Scripps Array Pfab Array DeRisi Array 7282 Daily Meta data GSE2265 Meta data Cowman Meta data Durasingh Meta data Baum Meta data Waters Meta data E-MEXP 128 Meta data E-MEXP 439 Meta data E-MEXP 449 Meta data GSE5247 Meta data GSE8099 Meta data Daily Array data GSE2265 array data Cowman Array data Durasingh Array data Baum Array data Waters Array data E-MEXP 128 Array data E-MEXP 439 Array data E-MEXP 449 Array data GSE5247 Array data GSE8099 Array data Daily RAD anal GSE2265 RAD anal Cowman RAD anal Durasingh RAD anal Baum RAD anal Waters RAD anal E-MEXP 128 RAD anal E-MEXP 439 RAD anal E-MEXP 449 RAD anal GSE5247 RAD anal GSE8099 RAD anal Waters Gametocyte Mass spec Waters Female Gametes mass Waters male Gametes mass Waters mixed Gametes mass Plasmodb Gene ids Sage tag Array design y2h Plasmo map interactome TIGR gene indexes Mutual info Pf chr Genbank refs PASA Db refs Hagai EC Winzeler Db refs Winzeler Lit refs Sage tag freqs Predicted Protein structs mr4 Cowman subcellular Haldar subcellular Apicopolast Florens 2002 Florens 2004 Broad SNP coverage Merozoite peptides lasonder oocycts Lasonder Oocycts sporozoites Lasonder salivary sporozoites evigan Broad bar code Broad 3k genotyping Entrez Dbrefs Pubmed dbrefs DeRisi Oligos DeRisi Dd2 DeRisi HB3 DeRisi 3D7
P. vivax resources TIGR gene indexes Watanabe Pv transcripts Watanabe Pv ESTs Pv contigs Pv dbrefs Pv GB dbrefs Pv mitochon Pv chromosomes
Start Plasmo Toxo start C.hominis C.parvum Synteny End Api End