440 likes | 575 Views
Databases as Analytical Engines for Drug Discovery. Susie Stephens Principal Product Manager, Life Sciences Oracle Corporation susie.stephens@oracle.com. Outline. Data Challenges Case Studies Summary. Access Distributed Data. External Sites. UltraSearch. Distributed query. MySQL.
E N D
Databases as Analytical Engines for Drug Discovery Susie StephensPrincipal Product Manager, Life Sciences Oracle Corporation susie.stephens@oracle.com
Outline • Data Challenges • Case Studies • Summary
Access Distributed Data External Sites UltraSearch Distributed query MySQL Flat files Sybase SRS DBlinks External Table Generic Connectivity Transparent Gateway Transparent Gateway
CLOBs XML Text Images Video Relational Users Defined Objects Nucleotide Sequences Gene Expression Data Papers Cell Histology Images Protein Folding Video SwissProt KEGG Chemical Structures Integrate a Variety of Data Types XML
Manage Vast Quantities of Data 50TB 40TB 30TB 20TB 10TB 0 • Partitioning • Oracle Data Guard • Real Application Clusters (RAC) • Automated Storage Management • Adaptive Instance Tuning • Automated Application and SQL Tuning • Automated Database Diagnostic Monitor (ADDM) • Scheduling
Collaborate Securely • Integrated communications • Single enterprise search • Flexible access • Fine grained access control • Auditing • Workflow • Personalized portal
Find Patterns and Insights • Oracle Data Mining • Find relationships & clusters • Oracle Discoverer & Oracle OLAP • Interactive query & drill-down • Statistics • mean, stdev, median, correlations, linear regression • Oracle Text • Cluster & Classify documents of interest • Table Functions • Implement complex algorithms within the database
Outline • Data Challenges • Case Studies • Summary
Regular Expression Searches • A powerful method of describing both simple & complex patterns for searching & manipulating • A multilingual regular expression support for SQL & PL/SQL string types • Follows POSIX style Regexp syntax • Support standard Regexp operators • Includes common extensions such as case-insensitive matching, sub-expression back-references, etc. • Compatible with popular Regexp implementations like GNU, Perl, Awk
Case Study: Retrieve Protein Data from SGD using Regular Expressions Case study courtesy of Prolexys Pharmaceuticals, Inc.
HTTP Raw Data </script> </head><body><body bgcolor='#FFFFFF'> <table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"> <input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /> </form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRAS WRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESK VFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFS VFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISES GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK* </pre><hr size="2" width="75%"> <table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"> <input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a> </form></td></tr></table></body></html>
Function to Parse out AA Sequence create or replace function orf2seq ( p_orf in varchar2 ) return varchar2 is v_stream clob; strt number; begin -- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map') ); -- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1); -- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]', ''); -- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}')); end;
AA Sequence for ORF ‘YDR099W’ SQL> select orf2seq('YDR099W') from dual; ORF2SEQ('YDR099W') -------------------------------------------------------------------------------- MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISESGQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK Elapsed: 00:00:01.24 SQL> insert into pseq (orf_id, sequence) 2 values ('YDR099W', orf2seq('YDR099W'));
Case Study: Motif Searching in Proteins PROSITE database of protein sequence motifs ID TYR_PHOSPHO_SITE; PATTERN AC PS00007 DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE) DE Tyrosine kinase phosphorylation site PA [RK]-x(2,3)-[DE]-x(2,3)-Y CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation CC /SKIP-FLAG=TRUE DO PDOC00007 Source: http://www.expasy.org/prosite/ps_frequent_patterns.txt • TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y • R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, Y=Tyrosine, x=any AA • Oracle10g Regular Expression Equivalent • [RK].{2,3}[DE].{2,3}[Y] Case study courtesy of Prolexys Pharmaceuticals, Inc.
SQL to Retrieve All Proteins Interacting with TKP select distinct substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4 from target_db a, y2h_interaction_p b where a.refseq_id like 'NP%' and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) = b.prey_refseq) ;
Query Results REFSEQ_ID SEQ_LENGTH MOTIF1_OFFS MOTIF2_OFFS MOTIF3_OFFS MOTIF4_OFFS ------------ ---------- ----------- ----------- ----------- ----------- NP_003961 146514 202 347 537 NP_003968 330 241 0 0 0 NP_003983 490 8 50 62 93 NP_004001 3562 3085 0 0 0 ... MHHCKRYRSPEPDPYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRSHDRLPYQRRYRERRDSDTYRCEERSPSFGEDYYGPSRSRHRRRSRERGPYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVGNLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLGKNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVADFGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHRTRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEERSFHTSRNPSR
SQL to Retrieve Motif Frequency by Protein select c.refseq_id "Refseq ID", rs2desc(c.refseq_id) "Protein Description", a.cnt "Repetitions", b.ps_ac "Prosite AC", b.descr "Motif Description" from motif_data a, ps_data b, target_dbp c where a.ps_ac = b.ps_ac and a.sequence_id = c.sequence_id order by 3 desc, 1 ;
Query Results Refseq ID Protein Description Repetitions Prosite AC Motif Description --------------- ------------------------------ ----------- ------------ ------------------------------ NP_055995.2 spectrin repeat containing, 145 PS00006 Casein kinase II phosphorylation site nuclear envelope 2 NP_056363.1 bullous pemphigoid antigen 1, 132 PS00006 Casein kinase II phosphorylation site 230/240kDa NP_001139.2 ankyrin 2, neuronal 115 PS00006 Casein kinase II phosphorylation site NP_066267.1 ankyrin 3, node of Ranvier 110 PS00006 Casein kinase II phosphorylation site (ankyrin G) NP_056363.1 bullous pemphigoid antigen 1, 102 PS00005 Protein kinase C phosphorylation site 230/240kDa NP_005520.2 heparan sulfate proteoglycan 2 97 PS00008 N-myristoylation site (perlecan) NP_066267.1 ankyrin 3, node of Ranvier 97 PS00005 Protein kinase C phosphorylation site (ankyrin G) P_001139.2 ankyrin 2, neuronal 96 PS00005 Protein kinase C phosphorylation site NP_115495.1 monogenic, audiogenic seizure 95 PS00006 Casein kinase II phosphorylation site susceptibility 1 homolog (mouse) ...
Regular Expression Searches Quote "Thanks to Oracle 10g's Regular Expressions (RE) query support, it's no longer necessary to export data from the database, process it with a RE enabled tool and then import the data back into the database. Now, RE processing can be handled with a single query." - Marcel Davidson, Head of Database Administration, Myriad Proteomics
C A T G 0 0 1 0 1 Oracle Data Mining BLAST • Implemented using a table function interface • BLAST search functions can be placed in SQL queries • Different functions for match & align • Combination of SQL queries & BLAST is very powerful & flexible
Case Study: BLAST as a Sequence Identification Tool • Identify protein with high sequence similarity and the functional class select function, COUNT(seq_id) f_count from (select t.seq_id, t.score, t.expect, g.function from SwissProt_DB g, Table(BLASTP_MATCH( ‘AEQAERYDDMAAAMKRY’, cursor (select seq_id, sequence from SwissProt_DB), 5)) t/* expect_value */ where t.seq_id = g.seq_id) group by function /* swissprot kw */ order by f_count function, f_count GROUP BY seq_id, function t.seq_id = g.seq_id seq_id, score, expect SwissProt_DB BLASTP_MATCH query_sequence, parameters SwissProt_DB
Case Study: Homology Search between Yeast and Human Data Yeast Protein Interactome Human Protein Interactome Homology Mapping A X Determined experimentally with Y2H C Determined experimentally with Y2H B Y Z Inferred through BLAST Interlogs: (A|X, B|Y) and (A|X, B|Z) Case study courtesy of Prolexys Pharmaceuticals, Inc.
Batch BLAST: Human (query) vs. Yeast (subject) for v1 in c1 loop insert into yeast_human_homolog ( human_refseq, yeast_orf_name, score, expect ) select v1.refseq_id, t.t_seq_id, t.score, t.expect from table ( blastp_match ( v1.sequence_string, cursor ( select a.yeast_acn, a.yeast_seq from yeast_prot_seq a ) ) ) t where t.expect < 0.00001 ; end loop;
BLAST Results Yeast YeastHuman Human Expect 1 Expect 2 Gene 1 Gene 2Refseq 1 Refseq 2 ------- ------- ----------- ----------- -------- -------- YAR018C YIL061C NP_XXXXX1.1 NP_YYYYY1.1 4.79E-12 4.58E-06 YBL016W YDL159W NP_XXXXX2.1 NP_YYYYY2.1 1.11E-08 5.25E-10 YBL016W YDL159W NP_XXXXX3.1 NP_YYYYY3.1 2.63E-10 9.04E-11 YBL016W YDL159W NP_XXXXX4.1 NP_YYYYY4.1 4.57E-07 8.33E-09 YBL016W YDL159W NP_XXXXX5.1 NP_YYYYY5.1 1.57E-22 1.11E-08 YBL063W YIL061C NP_XXXXX6.1 NP_YYYYY6.1 3.17E-64 8.67E-06 YBL063W YIL061C NP_XXXXX7.1 NP_YYYYY7.1 2.30E-06 4.58E-06 YBR109C YDR356W NP_XXXXX8.1 NP_YYYYY8.1 1.78E-07 7.74E-11 YBR109C YDR356W NP_XXXXX9.1 NP_YYYYY9.1 1.24E-08 7.74E-11 YBR109C YDR356W NP_XXXX10.1 NP_YYYY10.1 5.19E-07 2.80E-20 YBR109C YDR356W NP_XXXX11.1 NP_YYYY11.1 3.92E-10 4.39E-11 YBR109C YFR014C NP_XXXX12.1 NP_YYYY12.1 3.67E-48 6.91E-17 YBR109C YOL016C NP_XXXX13.1 NP_YYYY13.1 3.67E-48 1.82E-17 Yeast Interactors Human Interactors Interlogs
BLAST Quote "Oracle 10g's new BLAST feature will enable us to easily integrate multiple types of genomic and proteomic data for complicated queries used in the mining of our proprietary protein-protein interaction and cDNA sequence datasets." - Jake Chen, Principal Bioinformatics Scientist, Myriad Proteomics
Spatial Network Data Model • Data model for managing graph (link-node) structures • Rich graph analysis functions • Supports variety of network structures (hierarchical, directed, undirected, random, scale-free) • Framework for applying network constraints and rules (e.g. path length, cost, minimum bounding rectangle) • Bundled Java visualiser & APIs for 3rd party tools, application development
Case Study: Integration Architecture Native Formats GO NREF EMBL KEGG BIND AFCS Distributed Database layer • Data type determines available routes • Routes can be determined using semantics NDM layer (semantic layer) Nodes Edges Graph Network Route Case study courtesy of Beyond Genomics, Inc.
Network Data Model Quote "Beyond Genomics, Inc., as a leading systems biology company, believes that Oracle 10g's network data model will significantly advance the integration of metabolomic, proteomic, transcriptomic, and clinical data sets and the applications that derive value from these data." – Eric Neumann, Vice President Strategic Informatics, Beyond Genomics, Inc.
Oracle Data Mining • Unsupervised Learning • Hierarchical K-means Cluster • O-Cluster • Non-Negative Matrix Factorization • Apriori • Supervised Learning • Naïve Bayes • Adaptive Bayes Network • Support Vector Machines • PredictorVariance • ODM can mine structured data, text data, or structured and text data
K-Means Clustering • Hierarchical k-means produces tree of clusters • All splits are binary • Each cluster has a centroid & a histogram • Achieves a reliable solution in a single run • Ranked rules that describe attributes for cluster • Cluster assignments are probabilistic using a Bayesian model • Operates on very deep datasets by using a summarization module
Case Study: Brain Tumor Clustering • Collection of 42 Human Brain tumors* and 7,129 gene expression profiles • Clustering of samples according to their gene expression profiles • It is an example of class and taxonomy discovery • Does the data cluster according to the known biological classes? • 42 Tumor Samples: • Normal Cerebellum [MD] (4) • Malignant Gliomas [MGlio] (10) • Medulloblastomas [MD] (10) • Rhabdoid tumors [Rhabdoid] (10) • Primitive Neuroectodermal [PNET] (8) * Pomeroy et al Nature 415, 24, p436 (2002).
MD MGlio Rhabdoid Ncer PNET ODM Hierarchical k-Means Clustering Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Glioblastoma Normal Medulloblastoma Rhabdoid Cluster Cluster Cluster Cluster
MD MGlio Rhabdoid Ncer PNET Literature Results using Hierarchical Clustering From Pomeroy et al Nature 415, 24, p436 (2002).
Association Rules • Captures frequent co-occurrences of items/attribute values (A, B) => C occurrence or A and B together implies C • Can be applied in different scenarios • Market basket analysis • Pattern discovery • Predictive applications • ODM uses SQL-based implementation of Apriori algorithm
Case Study: Analysis of Trends in a Patient Group • Clinical Table of 60 Medulloblastoma Patients • 7 Clinical attributes: • Subtype: classic or desmoplastic medulloblastoma • Size (tumor size): T1-T4 • Stage: M0-M4 • Sex: M, F • Age (range): 0-5, 5-10, 10-15…. • Outcome: S (treatment success), • F (treatment failure) • Chemo (regime type): 0,1,2,3,4,5,6 * Pomeroy et al Nature 415, 24, p436 (2002).
Association Rules Results Over 100 rules reflecting factual or known relationships in data: Age=1 THEN Sex=M (confidence = 0.8) Interpretation: Most 5-10 year-old patients are male Subtype=Desmoplastic THEN Stage=M0 (confidence = 0.79) Interpretation: Most desmoplastic patients in the study have stage M0
Association Rules Results • Other interesting trends: • Stage=M0 THEN Outcome=S • (confidence = 0.74) • Interpretation: Stage M0 vs non-M0 is a predictor of treatment outcome • Stage=M0 AND Size=T3 AND Chemo=1 THEN Outcome=S • (confidence = 0.92) • Interpretation: Most patients with stage M0, size T3 who received chemo regime 1 had good response to treatment
Support Vector Machines • SVM provides a very general multi-purpose and powerful classifier • SVM does not require feature selection and can work well with thousands of input features • SVM is accurate and can approximate complex functional relationships • SVM works in binary, multi-class, sparse (text) classification and regression • SVM is easy to train and apply and can be used in discovery mode or in production automated methodologies
Case Study: Classification of Normal Human Tissue and Tumors • Multiple Examples (14) of normal human tissue and tumors • Could a single model distinguish normal vs cancer? • Train set: 200 samples, test set: 80 samples • Microarrays profiles for 7,129 genes Normal Tissue vs. Cancer S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA 98: 15149-15154 (2001)
Support Vector Machines Results Normal vs. Cancer (Multiple types) SVM Test Set Predictions Predicted Normal Cancer Actual Normal 16 10 Cancer 3 51 Test set accuracy: 83.75% (Naïve Bayes = 75%)
Classification of Multiple Tumor Types DNA Microarray Data for 14 Tumor Classes Published Datasets • S. Ramaswamy et al, Proc. Natl. Acad. Sci. USA 98: 15149-15154 (2001) • C. Yeang et al, Procs. of ISMB 2001. Bioinformatics Discovery Note, 1:1-7, (2001)
Results of Multiple Tumor Type Analysis • Gene expression profiles for 7,129 genes • Datasets tumor type composition: • 9 minutes training time on500MHz Netra • 78.3% accuracy for multi-tumor molecular classification
Outline • Data Challenges • Case Studies • Summary
Summary • Databases have functionality to access and integrate distributed data • There are data management, performance and security benefits to performing analytics in databases • A range of analytical functionality is now available in databases