370 likes | 493 Views
Session : 40382. Life Sciences: Data Revolution. Building Gene Expression Databases. Mahendra Navarange. Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK. Agenda. What is Life Science? MiMiR : database for gene expression data
E N D
Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases Mahendra Navarange Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK
Agenda • What is Life Science? • MiMiR : database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets
What is Life Sciences ? • Includes • Biology • BioTechnology • Chemistry • Pharmaceuticals • Agriculture / Plant Science • Environmental Sciences • ???? • Objective • Understand the molecular and evolutionary basis of living organisms
Focus Areas • Genomics • Human Genome Project • Draft published in 2000 • Finished version on 14 April 2003 • Sequencing data doubles every year • Transcriptomics • Study of transcription (gene expression) • Proteomics • Study of translation (protein synthesis) Courtesy F. Hoffmann-La Roche Ltd.
Data…Data…Data • Sanger Centre 5TB • Celera ~ 100TB+ (2001) TB
Data Revolution in Life Sciences • Impact of technology • High throughput platforms (HTP) • Robotics • Miniaturisation • Data driven science • Datawarehousing technologies • Data mining and visualisation software Information Technology Life Sciences
Databases • Genomics • Sanger • NCBI • TIGR • KEGG • Transcriptomics • ArrayExpress • Proteomics • Protein Databank (PDB) • SWISSPROT • Entrez
Target Validation Using Life Sciences Data • identify causes of genetic diseases • discover new drug compounds • personalised medicine • develop new diagnostics Drug Discovery Pipeline HTP Screening Target Identification Clinical Trials Hits Leads Leads FDA
Life Sciences : The Future • “…..biology is changing from a purely laboratory-based science to an information based science.” Eric Lander, Director, Whitehead Institute MIT
Agenda • What is Life Sciences ? • MiMiR: database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets
Transcriptomics • Comparing gene expression across databases • Collaborate to share expertise • Benefits • Diagnostics • Screen target drug compounds • Identify toxic side effects • Screen patients for clinical trials
Workflow Literature Experiment design Data HTP Preliminary Analysis Further Analysis Local DB GO NCBI Collaboration
HTP Microarray Platform : Hardware Courtesy Affymetrix Inc., Dell Inc
Microarray Data Acquisition Courtesy Fisher Scientific Courtesy Affymetrix Inc.
Microarray Data • High density microarray • ~ 500,000 spots of ~18 µm size • >20,000 genes • Typical file size 45MB • No. of files produced in typical experiment 10-20. Courtesy Affymetrix Inc.
Life Sciences Data Explosion • Data Characteristics • Image data generated by HTP platforms, annotation by researchers • Large volume and size • Varied data types • Datawarehousing challenges • Non-summarisable • High dimensionality • Limited knowledge of underlying biological processes • No standard industry data models or best practices
Agenda • What is Life Sciences ? • MiMiR: database for gene expression data • Data acquisition process and data characteristics • System requirements • Design issues • Code snippets
System Requirements • Seamless data integration • Handle wide range of datatypes • Processor intensive and I/O intensive • Exponential growth in data storage • Open architecture, collaboration
System Requirements • Rapid changes – new databases, technologies and instruments • Competitive pressures, quick response, low access times • Plug and play capability • Security
MIcroarray Data MIning Resource • MiMiR – Microarray Datawarehouse • ~250GB. Expected to double in next few months • ~2500 images, over 1500 BioAssays • 52 tables, largest table 15GB • Infrastructure • Oracle 9i Release 1 on Windows 2000 • Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard disk • 1 TB NAS capacity
Requirements vs. Solutions • Integrate different types of data sources • Use of XML for data exchange • Use of Oracle UltraSearch • Efficient data retrieval • Stringent response time standards on procedures • Indexed Organised Tables, Partitioning • Security • Firewall • Single Sign-On servers (in progress) • Rapid change management • BC4J framework, Jdeveloper • Extreme programming, prototyping
Annotation MAGE-ML Spot Info Images JDeveloper Ext Ref Blast 9iAS Admin MiMiR System Architecture MiMiR Application Server XSQL XSU XDK BC4J JClient JSP ArrayExpress Private
Oracle Products Used • Oracle 9i Database Server/Client (Release1) • Partitioning • Join indexing • Oracle 9i JDeveloper (9.0.2) • Oracle 9i Application Server (BC4J) • Oracle XML features • Oracle PL/SQL packages for XML • Oracle XSQL publishing framework • XDK (DOMParser and SAXParser) • XSU • Oracle Data Mining (Future) • Oracle Collaboration Suite (Future)
Why Oracle ? • Readily scalable • Manage wide variety of data types • Integrated development tools • Support XML and Java • High performance middleware • Secure collaboration
Agenda • What is Life Sciences ? • MiMir : database for gene expression data • Data acquisition and profiling • System requirements • Design issues • Code snippets
Storage Storing XML in tables Storing XML in CLOBs Hybrid Generation XDK for Java, PL/SQL XSU Transformation XSL Stylesheet Views Processing XDK DOMParser XDK SAXParser Searching XPATH Oracle Text Publishing XSQL publishing framework XSL Oracle and XML :Design Issues
Oracle and XML : XSQL Example <?xml version="1.0" encoding='windows-1252'?> <!-- | Uncomment the following processing instruction and replace | the stylesheet name to transform output of your XSQL Page using XSLT <?xml-stylesheet type="text/xsl" href="YourStylesheet.xsl" ?> --> <?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?> <xsql:query connection="micro" xmlns:xsql="urn:oracle-xsql"> select * from array </xsql:query>
Agenda • What is Life Sciences ? • MiMir : database for gene expression data • Data profiling • System requirements • Design issues • Code snippets
An Example • Creating XML from 500,000 records in the database
Solution 1 • Using XSU Java API to get XMLDOM. 1)conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) XMLDocument.print(out);
Solution 2 • Using XSU Java API to get XMLString. 1)conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) # XMLDocument.print(out); 7) System.out.println(q1.getXMLString());
Solution 3 • Using dbms_xmlquery package to get XML output from SQL Select dbms_xmlquery.getXML(‘select * from IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dual <?xml version = '1.0'?> <ROWSET> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG>
Summary • Life sciences is generating enormous amount of data using HTP • The data is non-summarisable, distributed and has varied data types • Data integration and secure collaboration is key to success • MiMiR
Dr. Helen Causton Prof. Tim Aitman Dr. Laurence Game Helen Banks Nicola Cooley Vihar Wadekar Helen Figueira MGED Data Society (www.mged.org) Acknowledgements
Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases What Next : Opportunities for collaboration for development of Knowledge Management Systems for Drug Discovery Contact: mahendra.navarange@csc.mrc.ac.uk http://microarray.csc.mrc.ac.uk