dbgg – database for genetical genomics update

dbgg – database for genetical genomics update Morris Swertz (m.a.swertz@rug.nl) Braunschweig CASIMIR meeeting July 2, 2008

Objective • Share genotype/phenotype data and tools:

10 10.000 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 10.000 process strains genome 10,000 markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 Complicated experiments microarrays probes

10 10.000 Collaborator 1 10.000 strains genome Incompatible data! markers inbreed 100 1,000,000 10,000 Collaborator 3 Incomplete data! individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 Collaborator 2 hybridize expressions preprocess norm exprs. network 100 100,000 Barriers to sharing data microarrays probes

10 10.000 Investigation 1 10.000 Incomplete and/or incompatible data! strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 Investigation 3 10 10.000 hybridize expressions preprocess norm exprs. network 10.000 strains genome 100 100,000 markers microarrays probes inbreed 100 1,000,000 10,000 Investigation 2 10 10.000 individuals genotype genotypes map QTL profiles correlate 10.000 strains genome 100,000 10,000,00 markers hybridize expressions preprocess norm exprs. network inbreed 100 100,000 100 1,000,000 10,000 microarrays probes individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Barriers to sharing data

10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 10 10.000 hybridize expressions preprocess norm exprs. network 10.000 strains genome 100 100,000 markers microarrays probes inbreed 100 1,000,000 10,000 10 10.000 individuals genotype genotypes map QTL profiles correlate 10.000 strains genome 100,000 10,000,00 markers hybridize expressions preprocess norm exprs. network inbreed 100 100,000 100 1,000,000 10,000 microarrays probes individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Barriers to sharing software tools

Hard to find and reuse tools 10,000 QTL profiles 10,000 QTL profiles 10,000 QTL profiles Barriers to sharing software tools

10 Use a standard tool? 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes

10 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 100.000 process strains genome 10,000 Yes, if it could be easily adapted! (and they can’t) SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network More biotechnologies, more protocols

Objectives • Share genotype/phenotype data and tools: • Interoperable software • Simple flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software • Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps

The software • Share genotype/phenotype data and tools: • Interoperable software • Simple flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software • Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps

Software: flat file exchange format • Raw and processed data in matrix form E.g. microarray data. Rows = individuals, cols = affy probes.

Software: flat file exchange format • Annotation info in tabular form E.g. probe annotation data. Rows = probes cols = attributes of each probe.

Software: exchange an experiment Described on http://gbic.biol.rug.nl/dbgg annotations Raw and processed data dbGG Import tool dbGG Export tool dbGG database

Software Software: web user interface http://gbicserver1.biol.rug.nl:8080/dbgg/molgenis.do

Software: interface to R source(“http://localhost:8080/molgenis4gg/R”) #download data use.experiment(name=“metanetwork”) #set default traits <- get.metabolitedata(name=“mytraits”) genotypes <- get.markerdata(name=“mygenotypes") #calculate mQTLs library(“MetaNetwork”) qtls <- qtlMapTwoPart(genotypes=genotypes, traits=traits, spike=4) #upload results for others to use add.mqtldata(qtls, name=“myqtls”) inspect MetaNetwork protocol: Fu, Swertz, Keurentjes, Jansen, Nature Protocols, 2007.

Software: interface to Taverna add dbGG interface

Software: interface to Taverna Use data in dbGG

This enables automatic processing(see also CASIMIR use ‘case 1’) dbGG Smedley, Swertz, Wolstencroft et al, Submitted.

Use BioMART and MOLGENIS to access data and Taverna to automate the workflows Gene symbols ws ws ws SNPs Strain SNP Alleles Pathways ws Your dbGG Smedley, Swertz, Wolstencroft et al, Submitted.

Software: extension procedure(using MOLGENIS) Little language  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Domain specific language  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Reusable assets and generator/interpreter + dbGG v1: for microarrays dbGG v2: for mass spectrometry

Software: extension procedure

Website: demos and downloads http://gbic.biol.rug.nl/dbgg

Outline • To share genotype/phenotype data and tools: 1. Interoperable software • Flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software 2. Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps

 Data • Simple and close to current practice: Genotype data Subjects: STRAINS M A R K E R S DATA ELEMENTS T r a i t s: TRAIT  SUBJECT

 Data • Simple and close to current practice: Genotype data Expression data Subjects: INDIVIDUALS P R O B E S DATA ELEMENTS T r a i t s: TRAIT  SUBJECT

 Data • Simple and close to current practice: Genotype data Expression data Classic phenotype data Metabolite abundance data Protein abundance data And so on… TRAIT  SUBJECT

 Data with any Dimension Type • Individual, • Strain, • Sample, • … SUBJECT TRAIT DATA ELEMENT • Probe • Marker • Mass Peak • … TRAIT  SUBJECT

 Data • Simple and close to current practice: What about QTL data? Traits: MARKERS P R O B E S DATA T r a i t s:

 Data • Simple and close to current practice: What about QTL data? Probe association data? Interaction network data? Traits: MARKERS P R O B E S DATA T r a i t s: TRAIT  TRAIT SUBJECT  SUBJECT

dimension ELEMENT columns rows  Data with any Dimension Type • Minimal data model SUBJECT TRAIT DATA ELEMENT DATA ELEMENT

The data model • To share genotype/phenotype data and tools: • Extensible data model • Data • Annotations • Investigations • Integration references

 Annotations • Simple and close to current practice Probe annotations • PROBE IS A VARIANT OF TRAIT • HAVING: • Name • Gene • Chromosme • Locus

 Annotation extends Trait or Subject SUBJECT • STRAIN • Name • Type: CSS, RIL.. • Parent Strains • INDIVIDUAL • Name • Strain • Mother • Father • Sex • SAMPLE • Name • Individual • Tissue And so on … TRAIT dimension ELEMENT • PROBE • Name • Gene • Chromosme • Locus column • MARKER • Name • Allele • Chromosme • Locus • MASSPEAK • Name • MZ • RetentionTime And so on … DATA ELEMENT row

 Annotation simple in practice QTL data Genotype data STRAIN MARKER MARKER DATA ELEMENT PROBE DATA ELEMENT Extensions are automatic “under the hood” PROBE isa TRAIT isa DIMENSION ELEMENT dimension ELEMENT Expression data INDIVIDL TRAIT MARKER DATA ELEMENT PROBE

 Data and  annotations DATA ELEMENTS PROBES

 Investigation workflow in the lab QTL data Genotype data DATA STRAIN DATA MARKER ? ? MARKER DATA ELEMENT PROBE DATA ELEMENT Expression data DATA INDIVIDL ? MARKER DATA ELEMENT

 Investigation building on FuGE QTL data Genotype data DATA Affy Array DATA QTL Mapping DATA DATA Affy M430 Protocol Affy M430 platform Bioconductor Norm. Mapping Protocol R Software FuGE: Expression data DATA DATA SNP Array DATA application Protocol Illumina Protocol Illumina Bead Studio Equipment Software FuGE: Jones et al Nature Biotech 25, 1127-1133

column row Summary of data model  PROBE MARKER STRAIN INDIVIDL …   SUBJECT DATA PROTOCOL APPLICTION INVESTI GATION Software TRAIT dimension ELEMENT Equipment PROTOCOL DATA ELEMENT

 References for integration • Ontology references and database references INVESTI GATION 2 INVESTI GATION 1 Hyperlink … Incompatible naming  Map mouse on human ontologies GENE Name = Mip1alpha GENE Name = Mip1a ONTOLOGY ENTRY Id = 0005615 Term = ABC Ontology=GO ONTOLOGY ENTRY Id = MP:0005385 Term = cardiovascular Ontology=MP Compatible Identifiers  DATABASE REFERENCE Id = ENSMUS098 Db=ENSEMBL DATABASE REFERENCE Id = ENSMU0S98 Db=ENSEMBL DATABASE REFERENCE Id = ENSMUS98 Db=ENSEMBL DATABASE REFERENCE Id = 1419561_AT Db=AFFY 430 FuGE: Jones et al Nature Biotech 25, 1127-1133

column row Summary of data model  PROBE MARKER STRAIN INDIVIDL extensible to more experiments…   SUBJECT DATA PROTOCOL APPLICTION INVESTI GATION Software TRAIT dimension ELEMENT Equipment PROTOCOL DATA ELEMENT  ONTOLOGY ENTRY Hyperlink … DATABASE REFERENCE

What is on the todo

Todo • Publication: submitted  • Building a catalog of tools on top of dbGG • Experiments: in Braunschweig and Groningen • Illumina, Affy, Metabolites • Tool ‘plug-ins’ • QTL graphs, import of annotations etc. • Exploit interoperability • E.g. integrate mouse & human with ontologies • Load annotations from other dbGG/BioMARTs • Build on and extend R/Taverna interaction

Summary and questions • Share genotype/phenotype data and tools: • Interoperable software • Simple flat file exchange format • Database server • R/web-service interfaces • A procedure to extend the software • Build on extensible data model • Data • Annotations • Investigations • Integration references • Next steps

m.a.swertz@rug.nl Morris A. Swertz Bruno M. Tesson Richard A. Scheltema Gonzalo Vera Rudi Alberts Damian Smedley Katy Wolstencroft Andrew R. Jones Klaus Schughart John M. Hancock Helen E. Parkinson Engbert O. de Brock Carole Goble Paul Schofield Ritsert C. Jansen the GEN2PHEN consortium the CASIMIR consortium Thank you

Appendix:Procedure to (re)generate a MOLGENIS

MOLGENIS for data

dbgg – database for genetical genomics update