550 likes | 670 Views
dbgg – database for genetical genomics update. Morris Swertz ( m.a.swertz@rug.nl ) Braunschweig CASIMIR meeeting July 2, 2008. Objective. Share genotype/phenotype data and tools:. 10. 10.000. Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information
E N D
dbgg – database for genetical genomics update Morris Swertz (m.a.swertz@rug.nl) Braunschweig CASIMIR meeeting July 2, 2008
Objective • Share genotype/phenotype data and tools:
10 10.000 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 10.000 process strains genome 10,000 markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 Complicated experiments microarrays probes
10 10.000 Collaborator 1 10.000 strains genome Incompatible data! markers inbreed 100 1,000,000 10,000 Collaborator 3 Incomplete data! individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 Collaborator 2 hybridize expressions preprocess norm exprs. network 100 100,000 Barriers to sharing data microarrays probes
10 10.000 Investigation 1 10.000 Incomplete and/or incompatible data! strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 Investigation 3 10 10.000 hybridize expressions preprocess norm exprs. network 10.000 strains genome 100 100,000 markers microarrays probes inbreed 100 1,000,000 10,000 Investigation 2 10 10.000 individuals genotype genotypes map QTL profiles correlate 10.000 strains genome 100,000 10,000,00 markers hybridize expressions preprocess norm exprs. network inbreed 100 100,000 100 1,000,000 10,000 microarrays probes individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Barriers to sharing data
10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 10 10.000 hybridize expressions preprocess norm exprs. network 10.000 strains genome 100 100,000 markers microarrays probes inbreed 100 1,000,000 10,000 10 10.000 individuals genotype genotypes map QTL profiles correlate 10.000 strains genome 100,000 10,000,00 markers hybridize expressions preprocess norm exprs. network inbreed 100 100,000 100 1,000,000 10,000 microarrays probes individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Barriers to sharing software tools
10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 10 10.000 hybridize expressions preprocess norm exprs. network 10.000 strains genome 100 100,000 markers microarrays probes inbreed 100 1,000,000 10,000 10 10.000 individuals genotype genotypes map QTL profiles correlate 10.000 strains genome 100,000 10,000,00 markers hybridize expressions preprocess norm exprs. network inbreed 100 100,000 100 1,000,000 10,000 microarrays probes individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Barriers to sharing software tools
Hard to find and reuse tools 10,000 QTL profiles 10,000 QTL profiles 10,000 QTL profiles Barriers to sharing software tools
10 Use a standard tool? 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes
10 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 100.000 process strains genome 10,000 Yes, if it could be easily adapted! (and they can’t) SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network More biotechnologies, more protocols
Objectives • Share genotype/phenotype data and tools: • Interoperable software • Simple flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software • Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps
The software • Share genotype/phenotype data and tools: • Interoperable software • Simple flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software • Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps
Software: flat file exchange format • Raw and processed data in matrix form E.g. microarray data. Rows = individuals, cols = affy probes.
Software: flat file exchange format • Annotation info in tabular form E.g. probe annotation data. Rows = probes cols = attributes of each probe.
Software: exchange an experiment Described on http://gbic.biol.rug.nl/dbgg annotations Raw and processed data dbGG Import tool dbGG Export tool dbGG database
Software Software: web user interface http://gbicserver1.biol.rug.nl:8080/dbgg/molgenis.do
Software: interface to R source(“http://localhost:8080/molgenis4gg/R”) #download data use.experiment(name=“metanetwork”) #set default traits <- get.metabolitedata(name=“mytraits”) genotypes <- get.markerdata(name=“mygenotypes") #calculate mQTLs library(“MetaNetwork”) qtls <- qtlMapTwoPart(genotypes=genotypes, traits=traits, spike=4) #upload results for others to use add.mqtldata(qtls, name=“myqtls”) inspect MetaNetwork protocol: Fu, Swertz, Keurentjes, Jansen, Nature Protocols, 2007.
Software: interface to Taverna add dbGG interface
Software: interface to Taverna Use data in dbGG
This enables automatic processing(see also CASIMIR use ‘case 1’) dbGG Smedley, Swertz, Wolstencroft et al, Submitted.
Use BioMART and MOLGENIS to access data and Taverna to automate the workflows Gene symbols ws ws ws SNPs Strain SNP Alleles Pathways ws Your dbGG Smedley, Swertz, Wolstencroft et al, Submitted.
Software: extension procedure(using MOLGENIS) Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Domain specific language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Reusable assets and generator/interpreter + dbGG v1: for microarrays dbGG v2: for mass spectrometry
Website: demos and downloads http://gbic.biol.rug.nl/dbgg
Outline • To share genotype/phenotype data and tools: 1. Interoperable software • Flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software 2. Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps
Data • Simple and close to current practice: Genotype data Subjects: STRAINS M A R K E R S DATA ELEMENTS T r a i t s: TRAIT SUBJECT
Data • Simple and close to current practice: Genotype data Expression data Subjects: INDIVIDUALS P R O B E S DATA ELEMENTS T r a i t s: TRAIT SUBJECT
Data • Simple and close to current practice: Genotype data Expression data Classic phenotype data Metabolite abundance data Protein abundance data And so on… TRAIT SUBJECT
Data with any Dimension Type • Individual, • Strain, • Sample, • … SUBJECT TRAIT DATA ELEMENT • Probe • Marker • Mass Peak • … TRAIT SUBJECT
Data • Simple and close to current practice: What about QTL data? Traits: MARKERS P R O B E S DATA T r a i t s:
Data • Simple and close to current practice: What about QTL data? Probe association data? Interaction network data? Traits: MARKERS P R O B E S DATA T r a i t s: TRAIT TRAIT SUBJECT SUBJECT
dimension ELEMENT columns rows Data with any Dimension Type • Minimal data model SUBJECT TRAIT DATA ELEMENT DATA ELEMENT
The data model • To share genotype/phenotype data and tools: • Extensible data model • Data • Annotations • Investigations • Integration references
Annotations • Simple and close to current practice Probe annotations • PROBE IS A VARIANT OF TRAIT • HAVING: • Name • Gene • Chromosme • Locus
Annotation extends Trait or Subject SUBJECT • STRAIN • Name • Type: CSS, RIL.. • Parent Strains • INDIVIDUAL • Name • Strain • Mother • Father • Sex • SAMPLE • Name • Individual • Tissue And so on … TRAIT dimension ELEMENT • PROBE • Name • Gene • Chromosme • Locus column • MARKER • Name • Allele • Chromosme • Locus • MASSPEAK • Name • MZ • RetentionTime And so on … DATA ELEMENT row
Annotation simple in practice QTL data Genotype data STRAIN MARKER MARKER DATA ELEMENT PROBE DATA ELEMENT Extensions are automatic “under the hood” PROBE isa TRAIT isa DIMENSION ELEMENT dimension ELEMENT Expression data INDIVIDL TRAIT MARKER DATA ELEMENT PROBE
Data and annotations DATA ELEMENTS PROBES
The data model • To share genotype/phenotype data and tools: • Extensible data model • Data • Annotations • Investigations • Integration references
Investigation workflow in the lab QTL data Genotype data DATA STRAIN DATA MARKER ? ? MARKER DATA ELEMENT PROBE DATA ELEMENT Expression data DATA INDIVIDL ? MARKER DATA ELEMENT
Investigation building on FuGE QTL data Genotype data DATA Affy Array DATA QTL Mapping DATA DATA Affy M430 Protocol Affy M430 platform Bioconductor Norm. Mapping Protocol R Software FuGE: Expression data DATA DATA SNP Array DATA application Protocol Illumina Protocol Illumina Bead Studio Equipment Software FuGE: Jones et al Nature Biotech 25, 1127-1133
column row Summary of data model PROBE MARKER STRAIN INDIVIDL … SUBJECT DATA PROTOCOL APPLICTION INVESTI GATION Software TRAIT dimension ELEMENT Equipment PROTOCOL DATA ELEMENT
The data model • To share genotype/phenotype data and tools: • Extensible data model • Data • Annotations • Investigations • Integration references
References for integration • Ontology references and database references INVESTI GATION 2 INVESTI GATION 1 Hyperlink … Incompatible naming Map mouse on human ontologies GENE Name = Mip1alpha GENE Name = Mip1a ONTOLOGY ENTRY Id = 0005615 Term = ABC Ontology=GO ONTOLOGY ENTRY Id = MP:0005385 Term = cardiovascular Ontology=MP Compatible Identifiers DATABASE REFERENCE Id = ENSMUS098 Db=ENSEMBL DATABASE REFERENCE Id = ENSMU0S98 Db=ENSEMBL DATABASE REFERENCE Id = ENSMUS98 Db=ENSEMBL DATABASE REFERENCE Id = 1419561_AT Db=AFFY 430 FuGE: Jones et al Nature Biotech 25, 1127-1133
column row Summary of data model PROBE MARKER STRAIN INDIVIDL extensible to more experiments… SUBJECT DATA PROTOCOL APPLICTION INVESTI GATION Software TRAIT dimension ELEMENT Equipment PROTOCOL DATA ELEMENT ONTOLOGY ENTRY Hyperlink … DATABASE REFERENCE
Todo • Publication: submitted • Building a catalog of tools on top of dbGG • Experiments: in Braunschweig and Groningen • Illumina, Affy, Metabolites • Tool ‘plug-ins’ • QTL graphs, import of annotations etc. • Exploit interoperability • E.g. integrate mouse & human with ontologies • Load annotations from other dbGG/BioMARTs • Build on and extend R/Taverna interaction
Summary and questions • Share genotype/phenotype data and tools: • Interoperable software • Simple flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software • Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps
m.a.swertz@rug.nl Morris A. Swertz Bruno M. Tesson Richard A. Scheltema Gonzalo Vera Rudi Alberts Damian Smedley Katy Wolstencroft Andrew R. Jones Klaus Schughart John M. Hancock Helen E. Parkinson Engbert O. de Brock Carole Goble Paul Schofield Ritsert C. Jansen the GEN2PHEN consortium the CASIMIR consortium Thank you