160 likes | 273 Views
The pPOD Core Data Model. The pPOD CDM team: Bill Piel, Shirley Cohen, Tim McPhillips, Shawn Bowers, Sarah Cohen-Boulakia, Val Tannen
E N D
The pPOD Core Data Model The pPOD CDM team: Bill Piel, Shirley Cohen, Tim McPhillips, Shawn Bowers, Sarah Cohen-Boulakia, Val Tannen Special thanks to Brent Mishler, David Maddison, Jeff Oliver, Rutger Vos, Francois Lutzoni, Martin Ramirez, Jonathan Coddington, Wayne Maddison, Fan Ge, Ashley Green,Jin Ruan, Martin Wu, John Lundberg, John Sullivan
Goals • The Core Data Model (CDM) under development in the pPOD project will serve the following purposes: • It will allow experimentation with the modeling of provenance in phylogenetic pipelines. • It will serve as a schema for a persistence tool, to work (1) in standalone mode, (2) with our lab notebook suite and (3) integrated with Mesquite as a module. 3. It will serve as a target for schema mappings used to connect other AToL databases, resources like TreeBASE, etc., using the Orchestra integration engine.
The Role of Provenance Backwards provenance “query” Starting from a research “product”, eg. a tree, a supertree, a matrix, track backwards through stored objects to all the raw input information that led to this product. Forwards provenance “query” Starting from a raw input, eg., a specimen, an image, a sequence, track forwards through stored objects to all research products that this input contributed to. In both cases, navigate biological assumptions in both directions, eg., homology assumptions.
store commands provenance query query (phylogenetic query language) AToL AAA schema mappings TreeBASE persistence manager RDBMS Persistence Tool CDM (an OO schema) Kepler-based workflow tool Mesquite module
AToL Data that needs to be modeled in CDM(not an exhaustive list) Analyzed data: trees, matrices,cells,(row) segments, operational taxomic units (OTUs),taxa, standard characters and their states, genes,gene fragments Raw data: standard views,images, sequences,chromatograms,primers, specimens,samples, collections
CDM: Phylogeny Inference Data Analyzed data: trees, matrices, operational taxomic units (OTUs), standardtaxa Tree provenance authority StdTaxon Matrix isA Set taxon OTU List StdMatrix SeqMatrix
Modeling Provenance (1) provenance Tree Matrix …but also… Software(Parameters) Author Date Must be modeled and stored explicitly! But it can be provided by automatic workflow tools
“Kinds” of Provenance In our CDM tools • Relationship between stored objects • Eg., tree T123 was obtained from matrix M456 by Joe Bio on 01/31/2001 using PAUP with parameters… (SEE PREVIOUS SLIDE) • Tracking through copy or cut/paste operations, possibly across repositories • Trace of data moving through a workflow • Sequence of timestamps, tool invocations (parameters), authors • Trace of data through a logically expressed view/query • Can be computed automatically as the view/query output is computed In our workflow tool
CDM: Morphological Data Analyzed data: standard matrices,cells, standard characters and their states, Raw data: standard views,images, specimens,collections
prov OTU Specimen Collection List Matrix StdMatrix Cell prov prov code(states) Set Image List prov StdChar StdView states : List <string> Set
Modeling Provenance (2) … img 194 … cell(0,0) tree T123 spec 19 … img 193 … … matrix M456 cell(28,23) img 206 spec 20 … … … img 204 … cell(28,45) spec 21 img 211 … … …
Example of Phylogenetic Query Find all standard matrices with some character C whose label contains the substring "elytra" and some OTU whose state for character C contains the substring "transverse"; return all such matrices, together with their characters, OTUs and states satisfying the conditions.
Semi-formalized (OQL) query example SELECT M, label of C, label of X, label of state encoded in cell E FROM M over all standard matrices, C over all characters of M, X over all OTUs of M, E is the cell corresponding to C and X in M WHERE the label of C is like "*elytra*" AND the label of the state encoded in cell E is like "*transverse*"
Molecular Data Analyzed data: sequence matrices,(row) segments, genes,gene fragments Raw data: sequences,chromatograms,primers, specimens,samples, collections
molecular matrix gene frag 1 gene frag 2 OTU1 OTU2 from some contig a row segment (from some sequence) from different specimens
prov??? List List Row Segment SeqMatrix endPos : int prov List List ColumnSeg OTU Contig endPos : int prov Set isA Raw Sequence Protein GeneFragment prov prov prov prov Set Set Primer Chromatogram prov prov Collection Specimen Sample