Life Sciences: Data Revolution

Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases Mahendra Navarange Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK

Agenda • What is Life Science? • MiMiR : database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets

What is Life Sciences ? • Includes • Biology • BioTechnology • Chemistry • Pharmaceuticals • Agriculture / Plant Science • Environmental Sciences • ???? • Objective • Understand the molecular and evolutionary basis of living organisms

Focus Areas • Genomics • Human Genome Project • Draft published in 2000 • Finished version on 14 April 2003 • Sequencing data doubles every year • Transcriptomics • Study of transcription (gene expression) • Proteomics • Study of translation (protein synthesis) Courtesy F. Hoffmann-La Roche Ltd.

Data…Data…Data • Sanger Centre 5TB • Celera ~ 100TB+ (2001) TB

Data Revolution in Life Sciences • Impact of technology • High throughput platforms (HTP) • Robotics • Miniaturisation • Data driven science • Datawarehousing technologies • Data mining and visualisation software Information Technology Life Sciences

Databases • Genomics • Sanger • NCBI • TIGR • KEGG • Transcriptomics • ArrayExpress • Proteomics • Protein Databank (PDB) • SWISSPROT • Entrez

Target Validation Using Life Sciences Data • identify causes of genetic diseases • discover new drug compounds • personalised medicine • develop new diagnostics Drug Discovery Pipeline HTP Screening Target Identification Clinical Trials Hits Leads Leads FDA

Life Sciences : The Future • “…..biology is changing from a purely laboratory-based science to an information based science.” Eric Lander, Director, Whitehead Institute MIT

Agenda • What is Life Sciences ? • MiMiR: database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets

Transcriptomics • Comparing gene expression across databases • Collaborate to share expertise • Benefits • Diagnostics • Screen target drug compounds • Identify toxic side effects • Screen patients for clinical trials

Workflow Literature Experiment design Data HTP Preliminary Analysis Further Analysis Local DB GO NCBI Collaboration

HTP Microarray Platform : Hardware Courtesy Affymetrix Inc., Dell Inc

Microarray Data Acquisition Courtesy Fisher Scientific Courtesy Affymetrix Inc.

Microarray Data • High density microarray • ~ 500,000 spots of ~18 µm size • >20,000 genes • Typical file size 45MB • No. of files produced in typical experiment 10-20. Courtesy Affymetrix Inc.

Life Sciences Data Explosion • Data Characteristics • Image data generated by HTP platforms, annotation by researchers • Large volume and size • Varied data types • Datawarehousing challenges • Non-summarisable • High dimensionality • Limited knowledge of underlying biological processes • No standard industry data models or best practices

Agenda • What is Life Sciences ? • MiMiR: database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets

System Requirements • Seamless data integration • Handle wide range of datatypes • Processor intensive and I/O intensive • Exponential growth in data storage • Open architecture, collaboration

System Requirements • Rapid changes – new databases, technologies and instruments • Competitive pressures, quick response, low access times • Plug and play capability • Security

MIcroarray Data MIning Resource • MiMiR – Microarray Datawarehouse • ~250GB. Expected to double in next few months • ~2500 images, over 1500 BioAssays • 52 tables, largest table 15GB • Infrastructure • Oracle 9i Release 1 on Windows 2000 • Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard disk • 1 TB NAS capacity

Requirements vs. Solutions • Integrate different types of data sources • Use of XML for data exchange • Use of Oracle UltraSearch • Efficient data retrieval • Stringent response time standards on procedures • Indexed Organised Tables, Partitioning • Security • Firewall • Single Sign-On servers (in progress) • Rapid change management • BC4J framework, Jdeveloper • Extreme programming, prototyping

Annotation MAGE-ML Spot Info Images JDeveloper Ext Ref Blast 9iAS Admin MiMiR System Architecture MiMiR Application Server XSQL XSU XDK BC4J JClient JSP ArrayExpress Private

Oracle Products Used • Oracle 9i Database Server/Client (Release1) • Partitioning • Join indexing • Oracle 9i JDeveloper (9.0.2) • Oracle 9i Application Server (BC4J) • Oracle XML features • Oracle PL/SQL packages for XML • Oracle XSQL publishing framework • XDK (DOMParser and SAXParser) • XSU • Oracle Data Mining (Future) • Oracle Collaboration Suite (Future)

Why Oracle ? • Readily scalable • Manage wide variety of data types • Integrated development tools • Support XML and Java • High performance middleware • Secure collaboration

Agenda • What is Life Sciences ? • MiMir : database for gene expression data • Data acquisition and profiling • System requirements • Design issues • Code snippets

Storage Storing XML in tables Storing XML in CLOBs Hybrid Generation XDK for Java, PL/SQL XSU Transformation XSL Stylesheet Views Processing XDK DOMParser XDK SAXParser Searching XPATH Oracle Text Publishing XSQL publishing framework XSL Oracle and XML :Design Issues

Oracle and XML : XSQL Example <?xml version="1.0" encoding='windows-1252'?>  <?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?> <xsql:query connection="micro" xmlns:xsql="urn:oracle-xsql"> select * from array </xsql:query>

Oracle and XML: Design Issues

Agenda • What is Life Sciences ? • MiMir : database for gene expression data • Data profiling • System requirements • Design issues • Code snippets

An Example • Creating XML from 500,000 records in the database

Solution 1 • Using XSU Java API to get XMLDOM. 1)conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) XMLDocument.print(out);

Solution 2 • Using XSU Java API to get XMLString. 1)conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) # XMLDocument.print(out); 7) System.out.println(q1.getXMLString());

Solution 3 • Using dbms_xmlquery package to get XML output from SQL Select dbms_xmlquery.getXML(‘select * from IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dual <?xml version = '1.0'?> <ROWSET> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG>

Summary • Life sciences is generating enormous amount of data using HTP • The data is non-summarisable, distributed and has varied data types • Data integration and secure collaboration is key to success • MiMiR

Dr. Helen Causton Prof. Tim Aitman Dr. Laurence Game Helen Banks Nicola Cooley Vihar Wadekar Helen Figueira MGED Data Society (www.mged.org) Acknowledgements

Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases What Next : Opportunities for collaboration for development of Knowledge Management Systems for Drug Discovery Contact: mahendra.navarange@csc.mrc.ac.uk http://microarray.csc.mrc.ac.uk

Life Sciences: Data Revolution

Life Sciences: Data Revolution

Presentation Transcript

Contents (click topic)

Alternative Careers in Life Sciences (What are your options after your Bachelor of Science degree ?)

Natural Sciences 360 Legacy of Life Lecture 17 Dr. Stuart S. Sumida

Analysis of Gene Expression Data

Chapter 25 The Industrial Revolution 1700-1900

Industrialization and Nationalism

The Road To Revolution

French Revolution

Overview of College of Social Sciences Academic Departments Data Fall 2011 update

The Haitian Revolution 1791-1804

The French Revolution

American Revolution 1775-1781

The French Revolution

French Revolution

Data Archiving @ SAP

Enlightenment and Revolution 1550-1789

The Technology Revolution Page 1 Overview

Chapter 7 The Industrial Revolution Begins

Biostatistics

Teaching Statistical Concepts with Simulated Data

American Revolution/ French Revolution