MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research.

MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research. Morris Swertz (m.a.swertz@rug.nl) Utrecht, BioAssist meeting, September 19, 2008

Where do I come from • MSc technology management, specialized in IT • Thesis on federated databases • One-man company • Information systems • PhD bioinformatics • “dynamic software infrastructures for the life sciences” • Now at Medical Genetics, University Medical Center Groningen

Ongoing work • BioBank platform leader • Cohort data, clinical phenotypes • Locus Specific databases • Molecular data (overlap with other platforms) • HTP genotype and phenotype experiments • QTL, GWAS • EU projects CASIMIR and GEN2PHEN • NPC workpackage 1 • Developing a platform for proteomics research • See Martijn. http://www.molgenis.org

Outline of talk • What is MOLGENIS? • Concept • Simple example • Practical examples • How to create a proper data model • Existing databases + Taverna • Hands-on session • Generate your first MOLGENIS • Plug-ins • Import/export

MOLGENIS Concepts and methods

What is MOLGENIS • Quotes “ It is a holistic bio-database in a box, which can fit any data” “It is the database that comes with a dial, to tune it to your research” “It is the database where you program one feature, and then get many features for free”

Cyber infrastructure? researchers user interaction infrastructure communication infrastructure data infrastructure bioinformaticians processing infrastructure Components of cyber infrastructure, Stein (2008) Nature Reviews Genetics 9: 678-688

Sharing data and reuse tools • I want to still generate my own flavor (incl existing software) • Have free access to more resources via standard interfaces … IS “my” + Ontology tools … IS “my” + processing tools … IS “my” + workflows

What does this mean in practice…

Large scale biology needs IT support Large datasets Dozens of samples Processing Complex relationships

A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83

€ bioinformatician softw engineers 10 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Biosoftware is hard The challenge: biologist biologist Swertz & Jansen (2007) Nature Genetics 8, 235-243

€ bioinformatician softw engineers 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network … biologist needs change Then “we” need to: …reinvent the wheel …but we were lazy  biologist biologist Swertz & Jansen (2007) Nature Genetics 8, 235-243

Strategy: a flexible platform 1x ∞x What? How? bioinformatician softw engineer Little language  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Swertz & Jansen (2007) Nature Genetics 8, 235-243

10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network “dial” to new research bioinformatician softw engineer Little language  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist http://www.molgenis.org Swertz & Jansen (2007) Nature Genetics 8, 235-243

Upgrade to new software tools bioinformatician softw engineer Little language  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network http://www.molgenis.org Swertz & Jansen (2007) Nature Genetics 8, 235-243

Sharing data and reuse tools • I want to still generate my own flavor (incl existing software) • Have free access to more resources via standard interfaces … IS “my” + Ontology tools … IS “my” + processing tools … IS “my” + workflows

Array Production Legend: Process Biomaterial Data file Genemap (*gbk) Well desc. Spotter settings Spot desc. Amplicon design Amplicon design Plate synthesis Plate Array batch Array design & production Array Chip Layout Control Scans Array Experiments Hybrid. Protocol Organism, Media Control Scans Labeling kit Control Scans Sampling cDNA Labeling Hybridi-sation Hybrid. Array Measurement Sample RNA Extraction Sample RNA Labeled cDNA Quant. file Hires Scan Grid file MOLGENIS: the software family … Proteomics Genetical Genomics Microarray Illumina arrays on mouse Affy arrays on mouse Qiagen arrays on C. elegans LC-MS on A. thaliana Software factory + Sharing components and easier to integrate because all MOLGENIS instances have standard generated interfaces

for processing for data and UI More projects use this concepts Swertz & Jansen (2007) Nature Genetics 8, 235-243

Basics

Open source biobase generator • Download for free at http://www.molgenis.org • works on Java, mySQL, Tomcat, Eclipse, Windows, Linux, Mac

Example 1: a MOLGENIS from scratch probes individuals expressions

Little OM language • Object model language: • Entities • Fields • Xrefs Is a ‘contract’ such as Machiel explained

Little OM language

Little UI language

Generate online or in Eclipse http://gbic.biol.rug.nl/supplementary/2007/molgenis_showcase

Result: Java code http://gbic.biol.rug.nl/supplementary/2007/molgenis_showcase

Under the hood DSL file  Customizing...  Generate  MyScript GUI FormGen TreeGen MenuGen Simple: Marker.Find() PluginGen MatrixGen APIs in Java, R, Web services and HTTP JDBCMapGen JTypeGen JReadCsvGen JListGen Complex: Select id,name, type from Item natural join Trait natural join … RListGen JDatabaseGen RMatrixGen HSQLGen DB in MySQL or HSQL WSGen MySQLGen

Result: many features for free Are ‘implementations of contracts’ that Machiel talked about: Java API SOAP API R-API Tab delimited API Database tables

Software: interface to R source(“http://localhost:8080/molgenis4gg/R”) #download data use.experiment(name=“metanetwork”) #set default traits <- get.metabolitedata(name=“mytraits”) genotypes <- get.markerdata(name=“mygenotypes") #calculate mQTLs library(“MetaNetwork”) qtls <- qtlMapTwoPart(genotypes=genotypes, traits=traits, spike=4) #upload results for others to use add.mqtldata(qtls, name=“myqtls”) inspect MetaNetwork protocol: Fu, Swertz, Keurentjes, Jansen, Nature Protocols, 2007.

Incl documentation 

Applications

Long projects Microaray experiments MOLGEN group, Groningen Genotypes and phenotypes Rudi Alberts, Braunschweig CILAIR, first NPC pilot Martijn Dijkstra, Isthiaq Ahmad, Groningen Animal Observatory Ate Boerema, Groningen Peptide and pathways Arjen Strijkstra, Groningen Recently started FINDIS database Juha Muilu, Finland MAGE-TAB Helen Parkinson, EBI Human variome + BioSQL Gudmundur Thorisson, Leicester More soon? Metabolomics, Floris Sluiter Chado, Victor de Jager NCP pilot 2, Don de Lange HTP sequencing… Ongoing projects

Case 1: A realistic model xGaP – the extensible genotype and phentotype database

Objective Integrated genetic study of (molecular) phenotypes • Challenge: various experimental designs • Flavors of QTL, GWAS, knockouts, etc • Array, MassSpec, Markers, SNPs, etc • Human, Mouse, Worm, Plant, etc. • “Standard” and extensible data representation • Ontology enabled, and in collaboration with other organizations like FuGE, MIQAS, PaGE-OM, OBO. • “Standard” cyber infrastructure • Format for exchange, e.g. XML or TAB formatted • Data management and searching, e.g. using mySQL • Communication, e.g. using web services Processing, e.g. using R

Integration of data, reuse of algorithms • It’s a genotype and phenotype database ‘in-a-box’ xGaP + Ontology tools xGaP + processing tools xGaP + workflows

Cyber infrastructure researchers xGaP user interaction infrastructure communication infrastructure data infrastructure bioinformaticians processing infrastructure Components of cyber infrastructure, Stein (2008) Nature Reviews Genetics 9: 678-688

Towards a real model:

Basic data? • Raw and processed data in matrix form Genotype data Subjects: STRAINS M A R K E R S DATA ELEMENTS T r a i t s: TRAIT  SUBJECT

Minimal and simple data model TRAIT  SUBJECT SUBJECT columns TRAIT DATA ELEMENT rows

Too simple? What about QTL data? Probe association data? Interaction network data? Traits: MARKERS P R O B E S DATA T r a i t s: TRAIT  TRAIT! SUBJECT SUBJECT?

dimension ELEMENT columns rows Minimal and simple data model TRAIT  SUBJECT TRAIT  TRAIT SUBJECT  SUBJECT SUBJECT columns TRAIT DATA ELEMENT rows DATA ELEMENT

Annotation information…of many types? 10 10.000 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 10.000 process strains genome 10,000 markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes

Systematic extension mechanism SUBJECT • STRAIN • Name • Type: CSS, RIL.. • Parent Strains • INDIVIDUAL • Name • Strain • Mother • Father • Sex • SAMPLE • Name • Individual • Tissue And so on … TRAIT dimension ELEMENT • PROBE • Name • Gene • Chromosme • Locus column • MARKER • Name • Allele • Chromosme • Locus • MASSPEAK • Name • MZ • RetentionTime And so on … DATA ELEMENT row

What about experimental design? • Using FuGE data elements: QTL data Genotype data DATA Affy Array DATA QTL Mapping DATA DATA Affy M430 Protocol Affy M430 platform Bioconductor Norm. Mapping Protocol R Software FuGE: Expression data DATA DATA SNP Array DATA application Protocol Illumina Protocol Illumina Bead Studio Equipment Software FuGE: Jones et al Nature Biotech 25, 1127-1133

Ontology enabled • Standard descriptions (semantics) are also essential for integration, next to standard structure (syntax) INVESTI GATION 2 INVESTI GATION 1 Hyperlink … Incompatible naming  Map mouse on human ontologies GENE Name = Mip1alpha GENE Name = Mip1a ONTOLOGY ENTRY Id = 0005615 Term = ABC Ontology=GO ONTOLOGY ENTRY Id = MP:0005385 Term = cardiovascular Ontology=MP Compatible Identifiers  DATABASE REFERENCE Id = ENSMUS098 Db=ENSEMBL DATABASE REFERENCE Id = ENSMU0S98 Db=ENSEMBL DATABASE REFERENCE Id = ENSMUS98 Db=ENSEMBL DATABASE REFERENCE Id = 1419561_AT Db=AFFY 430 FuGE: Jones et al Nature Biotech 25, 1127-1133

Standard extension mechanism for new research Standard structure to ease sharing of data and tools Standard extension mechanism for new research

Using the generator again…. bioinformatician softw engineer Little language  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model  <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Swertz & Jansen (2007) Nature Genetics 8, 235-243

MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research.