90 likes | 232 Views
Dataset Classes. A dataset class tells us: How to handle a particular type of dataset Exactly how to put it into manual delivery (it specifies the API for manual delivery) How to put it in the database (resource XML) How to process it in the workflow (graph XML). Human Roles.
E N D
Dataset Classes • A dataset class tells us: • How to handle a particular type of dataset • Exactly how to put it into manual delivery • (it specifies the API for manual delivery) • How to put it in the database • (resource XML) • How to process it in the workflow • (graph XML)
Human Roles • Dataset Integrator • Puts datasets into manual delivery (conforming to the dataset class API) • Provides a specification of each dataset for the workflow. • Workflow Pilot • Configures the workflow • Runs the workflow • Workflow Developer • Writes dataset classes • Writes graph files • Writes step classes • Writesplugins • ReFlow Developer • Develops underlying workflow system
Organism Abbrev • Throughout the workflow system, we use a unique, stable “identifier” for an organism: its organism abbrev • We do not use things like taxon IDs, scientific names, etc. • Examples: • tgonME49 • pfal3D7 • ncanLIV • It always includes: • One letter for the genus • Three letters for the species • The strain • Once it is set, it does not change, even if we adjust the name of the organism
Manual Delivery • Manual delivery has a very specific structure: manualDelivery/ project/ organismAbbrev/ category/ datasetName/ datasetVersion/ final/ fromProvider/ workspace/ README • final/ contains standard file names that conform to the dataset class API • Eg: SNPs.gff • They never have the name of the provider or any other dataset specific info
Datasets <dataset class=“dbxrefs”> <prop name=“orgAbbrev”>myOrg</prop> <prop name=“name”>uniprot</prop> <prop name=“version”>2.0</prop> </dataset> myOrg.xml Dataset Classes Workflow Plan <datasetClass name=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <prop name=“version”/> <graphPlanFile name=“dbXRefs.xml”/> <resource name=“${orgAbbrev}_${name}_dbxrefs”> <manualGet/> … </resource> </datasetClass> <workflow> <datasetTemplate class=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <subgraph name=“${orgAbbrev}_${name}_dbxrefs” xmlFile=“loadResources.xml”> <paramValue name=“what”>for</paramValue> </subgraph> </datasetTemplate> .. </workflow> classes.xml dbXRefs.xml Code generator Resources Top Level Graph <resources> <resource name=“myOrg_uniprot_dbxrefs”> … </resource> … <resource> Another Graph Another Graph Workflow Graph Another Graph <workflow> <step> <subgraph name=“myOrg_uniprot_dbxrefs”> <step> </workflow> myOrg.xml Generated files myOrg/dbXRefs.xml
Dataset Files ToxoDB.xml ToxoDB/tgonME49.xml ToxoDB/tgonME49/Einstein.xml Generates Resource Files Graph Files ToxoDB.xml ToxoDB/project.xml ToxoDB/tgonME49.xml ToxoDB/tgonME49/ESTs.xml ToxoDB/tgonME49/Einstein.xml ToxoDB/tgonME49/dbXRefs.xml ToxoDB/tgonME49/arrayStudies.xml ToxoDB/tgonME49/SNPs.xml ToxoDB/tgonME49/Einstein/chipChipSamples.xml
DataSource • We store simple meta information in the database about each dataset • Provider contact info • Descriptions • Display names • References to WDK searches , tables and attributes that use the data • The information is stored in two tables: • DataSource -- pulled right from the <resource> • DataSourceInfo -- provided by a specific file after loading data is completed • And it available in the WDK as a DataSource record • The search and record pages (egGene) can access this info for display purposes • Soon we will support searches for these, eg, find all searches that involve a certain dataset • It makes no sense to have two names: • <resource> • DataSource table and perl objects • So, either: • Rename <resource> to <datasource> • This is a pain to transition to in our code, • Or, rename DataSource to DataResource and keep <resource> as is
DataResource? • It makes no sense to have two names: • <resource> • DataSource table, perl objects, and WDK record • So, either: • Rename <resource> to <datasource> • This is a pain to transition to in our code, • Or, rename DataSource to DataResource and keep <resource> as is
DataResourceInfo • DatasetClasses do not include meta info about the dataset: • Contact info • Description • Mapping to wdk searches and records • DatasetClasses describe how to load the data • But, we can have DatasetClass