240 likes | 365 Views
Model of a real workflow. A subset of the plasmodb pipeline (in progress!) And issues to discuss…. PlasmoDB workflow. P.Falciparum Standard genome. P.Vivax Standard genome. P.Yoelli Standard genome. P.Berghei Standard genome. P.Chabaudi Standard genome. P.Knowlesi Standard
E N D
Model of a real workflow A subset of the plasmodb pipeline (in progress!) And issues to discuss…
PlasmoDB workflow P.Falciparum Standard genome P.Vivax Standard genome P.Yoelli Standard genome P.Berghei Standard genome P.Chabaudi Standard genome P.Knowlesi Standard genome P.Reichonowi Standard genome P.Gallenacium Standard genome P.Falciparum Non-standard synteny
Subflows (double line) Global steps (oval) Standard Genome Workflow Calculate Translated protein Splign NRDB taxonomy SO TIGR TGI In: Pf, Pk In: Pf, Pb, Py, Pv Genome Compile time Include/Exclude Extract genomic sequence Copy genomic seqs To cluster Extract proteins Copy proteins To cluster molecular Weight Min/max molecular weight Isolelectric point run TMHMM blastx Nrdb genome psipred blastp nrdb proteins Load TMHMM
Standard Genome Workflow Calculate Translated protein Splign NRDB taxonomy SO TIGR TGI In: Pf, Pb, Py, Pv In: Pf, Pk Genome Extract genomic sequence Copy genomic seqs To cluster Extract proteins Copy proteins To cluster molecular Weight Min/max molecular weight Isolelectric point run TMHMM blastx Nrdb genome psipred blastp nrdb proteins Load TMHMM
NRDB NRDB resource Copy from download site Shorten defline Copy to cluster Copy to cluster
Resources acquire unpack ext db Ext db rls insert
Psipred create psipred Data dir fix protein IDs For psipred create psipred Task dir copy Data Dir to cluster copy psipred Protein file to cluster start psipred On cluster wait for cluster copy psipred Files from cluster fix psipred File names make Alg Inv load psipred
BLAST Create Similarity dir Start blast Wait for cluster Copy files From cluster extract IDs From Blast result Optional step (runtime test) Load Subject subset Load Result
Splign Extract query Sequence Alt defline Extract subject Sequence Alt defline runSplign insertSplign
Steps • Subflows • Parameters • Constants • Interpolating variables • Global steps • Steps that are only executed once by the whole workflow, even if in multiple subflows • Declare a namespace? • Include/exclude • Compile time inclusion/exclusion • If not compiled in, flow passes right through • Skip-able steps • Runtime exclusion, based on a dynamic test
Step Values • Avoid side effects in file system (ok in database) • All files shared by steps must be passed as param values • outputFiles • inputFiles • Avoid hard-coded values • Use Constants • Avoid hand-coded values that change each build • Must be computed by step • Eg blast Y= value • External Db Rls values • Always pass external db rls spec, eg • Plasmodium Falciparum Chromosomes:2008-07-13 • Upgrade steps to conform to this • Table names • Want to be able to reuse these values across steps • Always use same format, eg: • Dots.ExternalNaSequence
Cluster • Wait for cluster step • Sends email • (takes list of email addresses as config. Maybe we should set up mailing list?) • Followed by a waitForHuman step. • By default is in “WAIT_FOR_HUMAN” state • Orthogonal to other states and offline status • Pilot can turn that off, and it will run
Configuration • Steps Configuration • Global • Commonly used properties • Not validated until runtime • Static • Defined per step class • Convenient, often all is necesssary • Cascading? • Multi-steps file • Distinguish between stable properties and mutable ones • Version numbers often change • Svn • Pilot configuration?
File & Directory Structure • Avoid side-effects • Use explicit input/ouputparams in xml file • Move to a nested data directory structure? /files/cbil/data/cbil/Plasmodb/5.5/workflow/data/ Seqfiles/ nrdb.fsa Pvivax/ Seqfiles/ Psipred/ Assembly/ ESTs/ Initial/ Intermediate/ • Would use the namespace attribute, somehow • Use path statement, eg: • ../ • ../tmhmm • Steps directories • Use nested structure for subflows?
GUI • Should it run in the web context? • Security issues • Avoids having to have installed software • Would work from home • All members of team could see the flow • Somehow restrict editability • Could be posted on real site as documentation? • Overkill? Too detailed? • Needs to handle subflows • Subflow node needs to show a summary of what is going on inside the subflow • Multi-colored, to show various states inside it • Gray out paths that are offline • Expand/collapse?
Resource Pipeline • Not worked out yet • Needs to be handled by regular subflow • Unpacks will need to be collapsed into a single unpack script • Resources.xml file as needed by front end can be produced by a documentation run of the pipeline • Does it need to be configured in xml, or would a properties file be good enough?
Documentation of the workflow • Workflow must be able to run in “documentation” mode • Doesn’t run any steps • Instead, produces documentation as expected by front end • Methods xml file • Resources xml file
Standard resources taxonomy SO EnzymeDB GO Codes GO NRDB dbEST [tax_id] Bibliographic Ref terms MO terms MO types MO MO Entry InterPro Orthomcl phyletic orthomcl
Plasmodb resources IEDB epitopes IEDB dbxrefs NA Genbank dbrefs AA Genbank dbrefs pdb Pdb index
P.falciparum resources Watanabe Pf transcripts Watanabe Pf ESTs Zhang ESTs Florent ESTs Pf plastid Pf mitochon Pf GO Associations Sanger IT SNPs SU SNPs Broad SNPs Combined SNPs Winzeler Genetic Var. array MTC KI Array Winzeler Cell Cycle Winzeler Gametocyte Scripps Array Pfab Array DeRisi Array 7282 Daily Meta data GSE2265 Meta data Cowman Meta data Durasingh Meta data Baum Meta data Waters Meta data E-MEXP 128 Meta data E-MEXP 439 Meta data E-MEXP 449 Meta data GSE5247 Meta data GSE8099 Meta data Daily Array data GSE2265 array data Cowman Array data Durasingh Array data Baum Array data Waters Array data E-MEXP 128 Array data E-MEXP 439 Array data E-MEXP 449 Array data GSE5247 Array data GSE8099 Array data Daily RAD anal GSE2265 RAD anal Cowman RAD anal Durasingh RAD anal Baum RAD anal Waters RAD anal E-MEXP 128 RAD anal E-MEXP 439 RAD anal E-MEXP 449 RAD anal GSE5247 RAD anal GSE8099 RAD anal Waters Gametocyte Mass spec Waters Female Gametes mass Waters male Gametes mass Waters mixed Gametes mass Plasmodb Gene ids Sage tag Array design y2h Plasmo map interactome TIGR gene indexes Mutual info Pf chr Genbank refs PASA Db refs Hagai EC Winzeler Db refs Winzeler Lit refs Sage tag freqs Predicted Protein structs mr4 Cowman subcellular Haldar subcellular Apicopolast Florens 2002 Florens 2004 Broad SNP coverage Merozoite peptides lasonder oocycts Lasonder Oocycts sporozoites Lasonder salivary sporozoites evigan Broad bar code Broad 3k genotyping Entrez Dbrefs Pubmed dbrefs DeRisi Oligos DeRisi Dd2 DeRisi HB3 DeRisi 3D7
P. vivax resources TIGR gene indexes Watanabe Pv transcripts Watanabe Pv ESTs Pv contigs Pv dbrefs Pv GB dbrefs Pv mitochon Pv chromosomes
Start Plasmo Toxo start C.hominis C.parvum Synteny End Api End