300 likes | 392 Views
SystemsX.ch. Data Challenges In Systems Biology Research In Switzerland. 26. August 2009, Baltimore. Systems Biology. As understood by a physicist. OK, so we have the Gene sequence (Genome) of many organisms, but what does it say? How does it all work?
E N D
SystemsX.ch Data Challenges In Systems Biology Research In Switzerland 26. August 2009, Baltimore
Systems Biology As understood by a physicist OK, so we have the Gene sequence (Genome) of many organisms, but what does it say? How does it all work? Cell Biology and Molecular Biology are producing more and more high-resolution and high-quality data to answer this question Bottom-up approach, understanding the cell from first principles is very difficult. Systems Biology Approach: Understand the available data top-down. Study the complex interaction of many levels of biological information to understand how they work together.
The Systems Biology Approach Biological System;Qualitative, Wet Lab Biology SystemsBiology: • Biology • Physics • Chemistry • Computer Science • Mathematics • Engineering • Medicine Network Theory;Modeling, Simulation Quantitative Data sets of Complete Systems; Bio-Engineering
SystemsX.ch Largest Swiss national research effort to date SCHWEIZERISCHE EIDGENOSSENSCHAFTCONFÉDÉRATION SUISSECONFEDERAZIONE SVIZZERACONFEDERAZIUN SVIZRA
More numbers.. • Funded by the Swiss government with CHF 100 Million for 2008-2011 • 11 Swiss Universities and Research Institutions invest a matching 100 Million • Projects approved by the SNSF • 14 large research projects (4-7MCHF) • 27 PhD projects • 20 interdisciplinary pilot projects • 1 strategic support project (7MCHF) IDIES Inauguration
SyBIT Project Motivation SystemsX.ch will produce and analyze a large amount of data Strong need for coordinationamong data providers Strong need for commonsemantics and compatibleservice offerings Increased need for professionally supportedtools and services
SyBIT Tasks • Assure that the RTDs have all the tools and infrastructure necessary to manage their data • Assure that well-described interfaces exist to all data products relevant for future exploration • Assure that the services and data produced are available also after the lifetime of the project • Assure that the knowledge and know-how to develop and maintain these services is built up and preserved within the participating institutions • Establish well-coordinated national support and development for future Systems Biology projects IDIES Inauguration
Data at the Center of Work • Almost everything is data driven • Many formats and access patterns Data IDIES Inauguration
Data Production • Many different kinds of instruments • Different data types • New instruments produce much more data • Data volume increasing exponentially Mass Spec Microarray Data Microscope O(TB)/Day HCS/HTS Simulation ... IDIES Inauguration
Data Validation and Filtering • Validation, checks • Conversion into standard formats • Compression • This can be very compute intensive • This can produce a lot of new data • Needs clusters to do it in a timely fashion O(TB) IDIES Inauguration
Data Tracking and Metadata • Provenance metadata • File catalogs • Metadata on initial filtering O(GB) IDIES Inauguration
Data Exploration • Interactive exploration of data • Small-scale analysis • Planning of large-scale data analysis IDIES Inauguration
Data Analysis and Modeling • Large-scale analysis • On as much CPU power as possible • Production of more data • Secondary datasets • Simulation, modeling • Additional databases • Metadata • Result data IDIES Inauguration
Publication and Archiving • Publication into public databases • Curation • Archiving for long-term storage O(100TB)/yr O(TB)/yr 26. August 2009, Baltimore IDIES Inauguration
Data Lifecycle • Some steps might be iterative • Users are not interested in technology, it simply has to work • Implementing ´Data´ such that all needs are met is a challenge • All SystemsX.ch projects: • O(PB)/yr kept data • O(TB)/yr published data • O(10^7) CPU Hours • Several different DBs and formats 26. August 2009, Baltimore IDIES Inauguration
Repetitive problem • Same lifecycle everywhere • Local Policies • Local Infrastructures • Local Services • Local Access control • Nontrivial coordination effort for RTDs to share data and services ETH EPFL RTDs UniXY UZH UniBas 26. August 2009, Baltimore IDIES Inauguration
So what´s the problem? • Data is not very large but there are many different kinds • People don´t understand it yet very well • How to evolve, schemas, versioning • How to be efficient about navigating in data – people still use excel sheets for that • No strong motivation to annotate data • Just publish what is necessary and prescribed by the journals • No recognition yet of producing ´good´ datasets • Data quality as such is not yet very high, not much reuse (excel sheets limiting factor) • Not many standards exist, there are too many formats • Instrument vendors often introduce new formats or conventions • No trust in central databanks • People think their data is special and may have some value (in $) • Not invented here effect is huge • Everyone builds their own submission sysem, LIMS, etc • Data loss • Student / postdoc leaves, nobody understands data anymore • Education of senior people concerning all of the above is not easy IDIES Inauguration
Divide and Conquer • Define Pipelines for each instrument type • Platform providers already exist and have a lot of experience we can build on • Define common storage types with storage providers • What kinds of storage services do we define? • Which kind of service is offered where? • Define common data retention semantics • What kind of data is to be kept on what type of storage for how long • How are the costs covered • Define interfaces for data sharing • Keep responsibility for the data where it was produced (ie. the people who know what it is) • Extract data into warehouses for each specific problem • Define final data repositories for public access • Some public repositories already exist also abroad, agree which one to publish to for each project • Where no such repository exists, define and build one for SystemsX.ch IDIES Inauguration
Pipeline Concept • Pipeline: End-to-end support of data flow from instrument to analysis and publication • Logical grouping of infrastructure and applications • Build on existing professional services, establish standards for data production, filtering, metadata and validation • Agree on initial data products • Agree on provided storage service types • Agree on data retention policies • Currently assembling Pipeline Teams IDIES Inauguration
Storage and Retention Scratch Durable Transactional Persistent Temporary Data Active Data Temporary Data Active Data Result Data Active Data Result Data Permanent Data Result Data Permanent Data IDIES Inauguration
Data Sharing: Rule 1 • Data producers are always responsiblefor their data. They may decide to • Provide access to their data themselves (operate own infrastructure) • Host data in a datacenter (outsource infrastructure but remain in control) • Send their data to a common and/or public resource (give up control, trust others to do the right thing) IDIES Inauguration
Data Sharing: Rule 2 • Data produced by SystemsX.ch projects has to be shared inside SystemsX.ch and will need to become public eventually. • What this means in detail has to be decided for each RTD • The responsible data producers have to agree on a common interface to access public data • Searches need to be possible using a web interface • Data extraction in standard formats needs to be easy IDIES Inauguration
Data Publishing in Practice • Establishing a common query and access interface is necessary • Type of publication has to be chosen for each RTD: • Local resource • Remote National resource • Remote International resource OR Managed locally Managed remotely 26. August 2009, Baltimore IDIES Inauguration
Data Sharing in Practice Query • Data access services are responsibility of the producers • SyBIT helps to develop and support these • Access services have a standard common interface • Data can be extracted directly or into data warehouses for further study where • Reindexing is possible • Optimization for new studies is easy Result IDIES Inauguration
SyBIT Approach • Hire people – embed at institutes participating in SystemsX.ch projects • Build trust • Avoid not-invented-here effect • Direct collection of requirements • Establish expert teams • Pipelines, instruments, computing infrastructure • Solve problems incrementally • Joint projects SyBIT-RTD, 3-6months, results put to immediate use • Short-term: infrastructure and data taking standardisation on formats • Mid-term: data lifecycle management • Longer-term (3years) introduce service-oriented architectures, application as a service • QA system: force basic ´best practice´ • Versioning, testing, documentation • Communications IDIES Inauguration
Status • Still hiring, most people start Q4/09 • Approach agreed with all partners • SyBIT QA Rules being set up • First pilot projects with limited available people finishing now – reusing results of previous projects • LIMS systems and metadata catalogs for 3 projects up and running • Followup project definitions and planning ongoing • Building teams for standardization efforts IDIES Inauguration
Thanks for your Attention Please visit www.systemsx.ch for more information Name of conference