780 likes | 937 Views
MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research. Morris Swertz ( m.a.swertz@rug.nl ) Utrecht, BioAssist meeting, September 19, 2008. Where do I come from. MSc technology management, specialized in IT Thesis on federated databases One-man company
E N D
MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research. Morris Swertz (m.a.swertz@rug.nl) Utrecht, BioAssist meeting, September 19, 2008
Where do I come from • MSc technology management, specialized in IT • Thesis on federated databases • One-man company • Information systems • PhD bioinformatics • “dynamic software infrastructures for the life sciences” • Now at Medical Genetics, University Medical Center Groningen
Ongoing work • BioBank platform leader • Cohort data, clinical phenotypes • Locus Specific databases • Molecular data (overlap with other platforms) • HTP genotype and phenotype experiments • QTL, GWAS • EU projects CASIMIR and GEN2PHEN • NPC workpackage 1 • Developing a platform for proteomics research • See Martijn. http://www.molgenis.org
Outline of talk • What is MOLGENIS? • Concept • Simple example • Practical examples • How to create a proper data model • Existing databases + Taverna • Hands-on session • Generate your first MOLGENIS • Plug-ins • Import/export
MOLGENIS Concepts and methods
What is MOLGENIS • Quotes “ It is a holistic bio-database in a box, which can fit any data” “It is the database that comes with a dial, to tune it to your research” “It is the database where you program one feature, and then get many features for free”
Cyber infrastructure? researchers user interaction infrastructure communication infrastructure data infrastructure bioinformaticians processing infrastructure Components of cyber infrastructure, Stein (2008) Nature Reviews Genetics 9: 678-688
Sharing data and reuse tools • I want to still generate my own flavor (incl existing software) • Have free access to more resources via standard interfaces … IS “my” + Ontology tools … IS “my” + processing tools … IS “my” + workflows
Large scale biology needs IT support Large datasets Dozens of samples Processing Complex relationships
A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83
A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83
A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83
€ bioinformatician softw engineers 10 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Biosoftware is hard The challenge: biologist biologist Swertz & Jansen (2007) Nature Genetics 8, 235-243
€ bioinformatician softw engineers 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network … biologist needs change Then “we” need to: …reinvent the wheel …but we were lazy biologist biologist Swertz & Jansen (2007) Nature Genetics 8, 235-243
Strategy: a flexible platform 1x ∞x What? How? bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Swertz & Jansen (2007) Nature Genetics 8, 235-243
10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network “dial” to new research bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist http://www.molgenis.org Swertz & Jansen (2007) Nature Genetics 8, 235-243
Upgrade to new software tools bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network http://www.molgenis.org Swertz & Jansen (2007) Nature Genetics 8, 235-243
Sharing data and reuse tools • I want to still generate my own flavor (incl existing software) • Have free access to more resources via standard interfaces … IS “my” + Ontology tools … IS “my” + processing tools … IS “my” + workflows
Array Production Legend: Process Biomaterial Data file Genemap (*gbk) Well desc. Spotter settings Spot desc. Amplicon design Amplicon design Plate synthesis Plate Array batch Array design & production Array Chip Layout Control Scans Array Experiments Hybrid. Protocol Organism, Media Control Scans Labeling kit Control Scans Sampling cDNA Labeling Hybridi-sation Hybrid. Array Measurement Sample RNA Extraction Sample RNA Labeled cDNA Quant. file Hires Scan Grid file MOLGENIS: the software family … Proteomics Genetical Genomics Microarray Illumina arrays on mouse Affy arrays on mouse Qiagen arrays on C. elegans LC-MS on A. thaliana Software factory + Sharing components and easier to integrate because all MOLGENIS instances have standard generated interfaces
for processing for data and UI More projects use this concepts Swertz & Jansen (2007) Nature Genetics 8, 235-243
Open source biobase generator • Download for free at http://www.molgenis.org • works on Java, mySQL, Tomcat, Eclipse, Windows, Linux, Mac
Example 1: a MOLGENIS from scratch probes individuals expressions
Little OM language • Object model language: • Entities • Fields • Xrefs Is a ‘contract’ such as Machiel explained
Generate online or in Eclipse http://gbic.biol.rug.nl/supplementary/2007/molgenis_showcase
Result: Java code http://gbic.biol.rug.nl/supplementary/2007/molgenis_showcase
Under the hood DSL file Customizing... Generate MyScript GUI FormGen TreeGen MenuGen Simple: Marker.Find() PluginGen MatrixGen APIs in Java, R, Web services and HTTP JDBCMapGen JTypeGen JReadCsvGen JListGen Complex: Select id,name, type from Item natural join Trait natural join … RListGen JDatabaseGen RMatrixGen HSQLGen DB in MySQL or HSQL WSGen MySQLGen
Result: many features for free Are ‘implementations of contracts’ that Machiel talked about: Java API SOAP API R-API Tab delimited API Database tables
Software: interface to R source(“http://localhost:8080/molgenis4gg/R”) #download data use.experiment(name=“metanetwork”) #set default traits <- get.metabolitedata(name=“mytraits”) genotypes <- get.markerdata(name=“mygenotypes") #calculate mQTLs library(“MetaNetwork”) qtls <- qtlMapTwoPart(genotypes=genotypes, traits=traits, spike=4) #upload results for others to use add.mqtldata(qtls, name=“myqtls”) inspect MetaNetwork protocol: Fu, Swertz, Keurentjes, Jansen, Nature Protocols, 2007.
Long projects Microaray experiments MOLGEN group, Groningen Genotypes and phenotypes Rudi Alberts, Braunschweig CILAIR, first NPC pilot Martijn Dijkstra, Isthiaq Ahmad, Groningen Animal Observatory Ate Boerema, Groningen Peptide and pathways Arjen Strijkstra, Groningen Recently started FINDIS database Juha Muilu, Finland MAGE-TAB Helen Parkinson, EBI Human variome + BioSQL Gudmundur Thorisson, Leicester More soon? Metabolomics, Floris Sluiter Chado, Victor de Jager NCP pilot 2, Don de Lange HTP sequencing… Ongoing projects
Case 1: A realistic model xGaP – the extensible genotype and phentotype database
Objective Integrated genetic study of (molecular) phenotypes • Challenge: various experimental designs • Flavors of QTL, GWAS, knockouts, etc • Array, MassSpec, Markers, SNPs, etc • Human, Mouse, Worm, Plant, etc. • “Standard” and extensible data representation • Ontology enabled, and in collaboration with other organizations like FuGE, MIQAS, PaGE-OM, OBO. • “Standard” cyber infrastructure • Format for exchange, e.g. XML or TAB formatted • Data management and searching, e.g. using mySQL • Communication, e.g. using web services Processing, e.g. using R
Integration of data, reuse of algorithms • It’s a genotype and phenotype database ‘in-a-box’ xGaP + Ontology tools xGaP + processing tools xGaP + workflows
Cyber infrastructure researchers xGaP user interaction infrastructure communication infrastructure data infrastructure bioinformaticians processing infrastructure Components of cyber infrastructure, Stein (2008) Nature Reviews Genetics 9: 678-688
Basic data? • Raw and processed data in matrix form Genotype data Subjects: STRAINS M A R K E R S DATA ELEMENTS T r a i t s: TRAIT SUBJECT
Minimal and simple data model TRAIT SUBJECT SUBJECT columns TRAIT DATA ELEMENT rows
Too simple? What about QTL data? Probe association data? Interaction network data? Traits: MARKERS P R O B E S DATA T r a i t s: TRAIT TRAIT! SUBJECT SUBJECT?
dimension ELEMENT columns rows Minimal and simple data model TRAIT SUBJECT TRAIT TRAIT SUBJECT SUBJECT SUBJECT columns TRAIT DATA ELEMENT rows DATA ELEMENT
Annotation information…of many types? 10 10.000 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 10.000 process strains genome 10,000 markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes
Systematic extension mechanism SUBJECT • STRAIN • Name • Type: CSS, RIL.. • Parent Strains • INDIVIDUAL • Name • Strain • Mother • Father • Sex • SAMPLE • Name • Individual • Tissue And so on … TRAIT dimension ELEMENT • PROBE • Name • Gene • Chromosme • Locus column • MARKER • Name • Allele • Chromosme • Locus • MASSPEAK • Name • MZ • RetentionTime And so on … DATA ELEMENT row
What about experimental design? • Using FuGE data elements: QTL data Genotype data DATA Affy Array DATA QTL Mapping DATA DATA Affy M430 Protocol Affy M430 platform Bioconductor Norm. Mapping Protocol R Software FuGE: Expression data DATA DATA SNP Array DATA application Protocol Illumina Protocol Illumina Bead Studio Equipment Software FuGE: Jones et al Nature Biotech 25, 1127-1133
Ontology enabled • Standard descriptions (semantics) are also essential for integration, next to standard structure (syntax) INVESTI GATION 2 INVESTI GATION 1 Hyperlink … Incompatible naming Map mouse on human ontologies GENE Name = Mip1alpha GENE Name = Mip1a ONTOLOGY ENTRY Id = 0005615 Term = ABC Ontology=GO ONTOLOGY ENTRY Id = MP:0005385 Term = cardiovascular Ontology=MP Compatible Identifiers DATABASE REFERENCE Id = ENSMUS098 Db=ENSEMBL DATABASE REFERENCE Id = ENSMU0S98 Db=ENSEMBL DATABASE REFERENCE Id = ENSMUS98 Db=ENSEMBL DATABASE REFERENCE Id = 1419561_AT Db=AFFY 430 FuGE: Jones et al Nature Biotech 25, 1127-1133
Standard extension mechanism for new research Standard structure to ease sharing of data and tools Standard extension mechanism for new research
Using the generator again…. bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Swertz & Jansen (2007) Nature Genetics 8, 235-243