330 likes | 440 Views
Can New Oracle10 g Search Features Help Bridge the Biological Discovery Gap?. Marcel Davidson Head of Data Management. Jake Y. Chen, Ph.D. Head of Computational Proteomics & Principle Bioinformatics Scientist. Messages. New Informatics Challenges in Protein Interactomics R&D
E N D
Can New Oracle10g Search Features Help Bridge the Biological Discovery Gap? Marcel Davidson Head of Data Management Jake Y. Chen, Ph.D. Head of Computational Proteomics & Principle Bioinformatics Scientist
Messages • New Informatics Challenges in Protein Interactomics R&D • Scale, integration, discovery issues • A data-driven discovery-oriented framework • “Enabling” Features in 10g? • Biological data integration? • Biological data analysis integration?
Outline • Data-driven Discovery-oriented Computational Framework • 10g Regular Expression Case Studies • 10g BLAST Case Studies
Why Myriad Maps Protein-Protein Interactions Conventional Drug Discovery Post-Genomic Drug Discovery GPCR enzyme Nucleus Nucleus hormone receptor novel, more specific targets non-specific targets novel, druggable targets target validation enhanced pre-validation target pool lead discovery, optimization
Principle of the Yeast Two-Hybrid (Y2H) System Scenario A: Human Proteins X and Y do Interact Activation Domain Prey Human Protein Y Human Protein X Reporter mRNA Reporter mRNA Readout: Yeast colonies grow Reporter mRNA DNA Binding Domain Bait Reporter Gene DNA Scenario B: Human Proteins X and Z do not Interact Prey Activation Domain Human Protein Z Human Protein X Readout: No growth of yeast colonies ( No Reporter Gene Activity ) DNA Binding Domain Bait Reporter Gene DNA
Data Collected from Y2H System Perform BLAST Against Human REFSEQ DB
Protein Interaction Network (Snapshot of ~8,000 interactions)
>1.5million sequence fragments • ~250,000 search experiments performed • several TB data storage Experimental Measurement ? • ~1000 relevant interactions for each interested pathways DistilledInformation • 1-10 novel drug targets per disease Marketable Knowledge Knowledge Discovery (KD) Challenges • ~80,000 unique interactions • ~100 biological data sources Protein Interaction Data $$$, drugs, … • Data-driven • Discovery-oriented
Data Cleansing • Statistical Data Analysis • Domain-specific • Data Modeling Genomics/Functional Genomics Data Reduce Data Noise Represent Interactions and Pathways • Data Integration • LIMS Programming E-RDBMS Collect raw sequences and lab condition measurements Organize Data in Regulatory Pathways • DB Querying • Visualization Select and Validate Drug targets • Knowledge Curation KD in Interaction-based Proteomics
Bioinformatics DB Framework Y2H Data Processing and Analysis DB Lab_Seq, Seq_Match, Y2H_Mart • Annotation DB • RefSeq, LocusLink, GO, OMIM, CGAP, Protein Kinase DB, GPCR DB, Ensemble, Curation, … • Y2H Interaction Data Mart • Y2h_Mart
A Schema Fragment to Manage Sequence Similarity Results Jake Yue Chen and John Carlis (2003) Genomic Data Modeling. Information Systems Journal, 28(4), p287-310.
Interaction Matrix using Randomly Ordered Locus IDs • 12,958 unique Interactions • 1955 bait loci • 2766 prey loci Jake Yue Chen, et al (2003) Proceedings of the IEEE Computer Science Society Bioinformatics Conference 2003. Stanford University, Stanford, CA.
Outline • Data-driven Discovery-oriented Computational Framework • 10g Regular Expression Case Studies • 10g BLAST Case Studies
Oracle10g Regular Expressions: Powerful String Processing • RE new tools in Oracle10g • Search and manipulate data strings of arbitrary complexity • Prior database solutions • SQL LIKE operator • Java stored procedures, C external libraries • Prior non-database solutions: AWK, SED, GREP, PERL, etc. • Done now inside database • Facilitates rapid data-centric analysis
Case1: Retrieving Protein data from SGD (Saccharomyces Genome Database) ORF Identifier Associated Amino Acid Sequence
HTTP Raw Data </script> </head><body><body bgcolor='#FFFFFF'> <table cellpadding="2" width="100%" cellspacing="0" border="0"><tr><td colspan="4"><hr width="100%" /></td></tr><tr><td valign="middle" align="right"><a href="http://www.yeastgenome.org/"><img alt="SGD" border="0" src="http://www.yeastgenome.org/images/SGD-to.gif" /></a></td><th valign="middle" nowrap="1">Quick Search:</th><td valign="middle" align="left"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/search/quickSearch" enctype="application/x-www-form-urlencoded"> <input type="text" name="query" size="13" /><input type="submit" name="Submit" value="Submit" /> </form></td><th valign="middle" align="left"><a href="http://www.yeastgenome.org/sitemap.html">Site Map</a> | <a href="http://www.yeastgenome.org/HelpContents.shtml">Help</a> | <a href="http://www.yeastgenome.org/SearchContents.shtml">Full Search</a> | <a href="http://www.yeastgenome.org/">Home</a></th></tr><tr><td align="left" colspan="4"><table cellpadding="1" width="100%" cellspacing="0" border="0"><tr align="center" bgcolor="navajowhite"><td><font size="-1"><a href="http://www.yeastgenome.org/ComContents.shtml">Community Info</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/SubmitContents.shtml">Submit Data</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd">BLAST</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/web-primer">Primers</a></font></td><td><font size="-1"><a href="http://seq.yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch">PatMatch</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/seqTools">Gene/Seq Resources</a></font></td><td><font size="-1"><a href="http://www.yeastgenome.org/Vl-yeast.shtml">Virtual Library</a></font></td><td><font size="-1"><a href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Contact SGD</a></font></td></tr></table></td></tr><tr><td colspan="4"><hr width="100%" /></td></tr></table><table cellpadding="0" width="100%" cellspacing="0" border="0"><tr><td width="10%"><br /></td><td valign="middle" align="center" width="80%"><h1>Sequence for a region of YDR099W/BMH2</h1></td><td valign="middle" align="right" width="10%"></td></tr></table><p /><center><a target="infowin" href="http://db.yeastgenome.org/cgi-bin/SGD/suggestion">Send questions or suggestions to SGD</a></center><p /><p /><center><a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-blast2sgd?name=YDR099W&suffix=prot">BLAST search</a> | <a target="infowin" href="http://seq.yeastgenome.org/cgi-bin/SGD/nph-fastasgd?name=YDR099W&suffix=prot">FASTA search</a></center><p /><center><hr width="35%" /></center><p /><font color="FF0000"><strong>Protein translation of the coding sequence.</strong></font><p /><p />Other Formats Available: <a href="http://db.yeastgenome.org/cgi-bin/SGD/getSeq?map=pmap&seq=YDR099W&flankl=0&flankr=0&rev=">GCG</a><pre>>YDR099W Chr 4 MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRAS WRIVSSIEQKEESKEKSEHQVELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESK VFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEIATTELPPTHPIRLGLALNFS VFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISES GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK* </pre><hr size="2" width="75%"> <table width="100%"><tr><td valign="top" align="left"><a href="http://www.yeastgenome.org/"><img border="0" src="http://www.yeastgenome.org/images/arrow.small.up.gif" />Return to SGD</a></td><td valign="bottom" align="right"><form method="post" action="http://db.yeastgenome.org/cgi-bin/SGD/suggestion" enctype="application/x-www-form-urlencoded" target="infowin" name="suggestion"> <input type="hidden" name="script_name" value="/cgi-bin/SGD/getSeq" /><input type="hidden" name="server_name" value="db.yeastgenome.org" /><input type="hidden" name="query_string" value="seq=YDR099W&flankl=0&flankr=0&map=p3map" /><a href="javascript:document.suggestion.submit()">Send a Message to the SGD Curators<img border="0" src="http://www.yeastgenome.org/images/mail.gif" /></a> </form></td></tr></table></body></html> Need to parse out embedded AA Sequence
Function to Return AA Sequence Given ORF Parameterized ORF Id Web site URL create or replace function orf2seq ( p_orf in varchar2 ) return varchar2 is v_stream clob; strt number; begin -- Retrieve the HTTP stream: v_stream := httpuritype.getclob(httpuritype.createuri( 'http://db.yeastgenome.org/cgi-bin/SGD/getSeq?seq='||p_orf|| '&flankl=0&flankr=0&map=p3map') ); -- Trim off the head of the stream: strt := dbms_lob.instr(v_stream, 'Submit', 1, 1); -- Strip out control characters, new lines, etc.: v_stream := regexp_replace(dbms_lob.substr(v_stream, 4000, strt), '[[:cntrl:]]', ''); -- Return the AA sequence: return(regexp_substr(dbms_lob.substr(v_stream, 4000, strt), '[[:upper:]]{10,}')); end; RegExp to remove control chars from HTTP stream RegExp to extract AA sequence
Amino Acid Sequence for ORF ‘YDR099W’ SQL> select orf2seq('YDR099W') from dual; ORF2SEQ('YDR099W') -------------------------------------------------------------------------------- MSQTREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRASWRIVSSIEQKEESKEKSEHQ VELIRSYRSKIETELTKISDDILSVLDSHLIPSATTGESKVFYYKMKGDYHRYLAEFSSGDAREKATNSSLEAYKTASEI ATTELPPTHPIRLGLALNFSVFYYEIQNSPDKACHLAKQAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDISES GQEDQQQQQQQQQQQQQQQQQAPAEQTQGEPTK Elapsed: 00:00:01.24 Elapsed time <2 sec. (network latency) SQL> insert into pseq (orf_id, sequence) 2 values ('YDR099W', orf2seq('YDR099W'));
Case 2: Motif Searching in Proteins TKP TKP motif pattern PROSITE database of protein sequence motifs ID TYR_PHOSPHO_SITE; PATTERN. AC PS00007; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE). DE Tyrosine kinase phosphorylation site. PA [RK]-x(2,3)-[DE]-x(2,3)-Y. CC /TAXO-RANGE=??E?V; CC /SITE=5,phosphorylation; CC /SKIP-FLAG=TRUE; DO PDOC00007; Source: http://www.expasy.org/prosite/ps_frequent_patterns.txt • TKP Pattern: [RK]-x(2,3)-[DE]-x(2,3)-Y. • R=Arginine, K=Lysine, D=Aspartate, E=Glutamate, Y=Tyrosine, x=any AA • Oracle10g Regular Expression Equivalent • [RK].{2,3}[DE].{2,3}[Y] 1 Aspartate or Glutamate 2 – 3 Any 2 – 3 Any 1 Arginine or Lysine 1 Tyrosine
SQL Example: Retrieving all Interacting Proteins with TKP select distinct substr(a.refseq_id, 1, 9) refseq_id, length(a.seq_string_varchar) seq_length, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 1) motif_offs1, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 2) motif_offs2, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 3) motif_offs3, regexp_instr(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]', 1, 4) motif_offs4 from target_db a, y2h_interaction_p b where a.refseq_id like 'NP%' and regexp_like(a.seq_string_varchar, '[RK].{2,3}[DE].{2,3}[Y]') and (substr(a.refseq_id,1,9) = b.bait_refseq or substr(a.refseq_id,1,9) = b.prey_refseq) ; Returns first 4 instances of TKP in each sequence Returns all rows with TKP site
SQL Example Output Motif #2 at offset 50 Motif #3 at offset 62 Motif #1 at offset 8 REFSEQ_ID SEQ_LENGTH MOTIF1_OFFS MOTIF2_OFFS MOTIF3_OFFS MOTIF4_OFFS ------------ ---------- ----------- ----------- ----------- ----------- NP_003961 1465 14 202 347 537 NP_003968 330 241 0 0 0 NP_003983 490 8 50 62 93 NP_004001 3562 3085 0 0 0 ... MHHCKRYRSPEPDPYLSYRWKRRRSYSREHEGRLRYPSRREPPPRRSRSRSHDRLPYQRRYRERRDSDTYRCEERSPSFG EDYYGPSRSRHRRRSRERGPYRTRKHAHHCHKRRTRSCSSASSRSQQSSKRTGRSVEDDKEGHLVCRIGDWLQERYEIVG NLGEGTFGKVVECLDHARGKSQVALKIIRNVGKYREAARLEINVLKKIKEKDKENKFLCVLMSDWFNFHGHMCIAFELLG KNTFEFLKENNFQPYPLPHVRHMAYQLCHALRFLHENQLTHTDLKPENILFVNSEFETLYNEHKSCEEKSVKNTSIRVAD FGSATFDHEHHTTIVATRHYRPPEVILELGWAQPCDVWSIGCILFEYYRGFTLFQTHENREHLVMMEKILGPIPSHMIHR TRKQKYFYKGGLVWDENSSDGRYVKENCKPLKSYMLQDSLEHVQLFDLMRRMLEFDPAQRITLAEALLHPFFAGLTPEER SFHTSRNPSR Motif #4 at offset 93 [RK].{2,3}[DE].{2,3}[Y] Result: 702 (56%) interacting proteins with TKP site
Is 56% TKP in interacting proteins significant? • Random sample test of all NP entries • N = 33 random samples • Sample size 7.4% (~1251) • Sample mean = 515 • SD = 17.2 • Significance level < 1E-30
Outline • Data-driven Discovery-oriented Computational Framework • 10g Regular Expression Case Studies • 10g BLAST Case Studies
Similarity Search (Sequence Comparison): A Routine Biology Task k Pair-wise Comparison Results nTarget Sequences A Query Sequence Similarity Search has not been integrated into the DB system.
Using BLAST can be a laborious process & a data-management hell • Custom setup of BLAST target database • Iterate through query sequences: “Batch BLAST” • Export/parse/filter/import data <-> DBMS • Integration of results with external data
Case 1: Oracle 10g BLASTN as a sequence identification tool -- A sequence fragment with a sequence_id = 100 -- Sequence is stored in the query_db table. TACACACCTCGGCGTCGCAGCTCTCGATCATCTCCGGAGCTAACAAGGAAGGCCGGACTGTCCCGTAGAAGCCGCTCTGC SELECT t.t_seq_id, t.expect FROM TABLE ( BLASTN_MATCH ( (select sequence FROM query_db where sequence_id = 100), CURSOR (select refseq_id, sequence_string FROM target_db where refseq_id like 'NM_%') ) ) t WHERE t.expect < 1E-20; T_SEQ_ID EXPECT ----------------- -------------- NM_016333.2 0
Case 2: Discovering “Interlogs” Yeast Protein Interactome Human Protein Interactome Homology Mapping A X C B Y Z Interlogs: (A|X, B|Y) and (A|X, B|Z)
A Computational Intensive Task • Data to use • Yeast Protein-Protein Interaction Data • Yeast Protein Sequences • Human Protein-Protein Interaction Data • Human Protein Sequences & Annotations • Analysis to prepare • Homology search: yeast vs. human proteins • Things to consider • Collect/parse public data from web • Import/export data for BLAST • Connect analysis result to internal data Missing Data Laborious Traditional way? Or inside DBMS?
Pipelining Missing Data Directly into BLASTP Searches insert into yeast_human_homolog select 'YDR099W‘ Yeast_ORF_name, t.t_seq_id Human_refseq, t.expect E_Value from TABLE ( BLASTP_MATCH ( (SELECT orf2seq ('YDR099W') FROM dual), CURSOR (SELECT refseq_id, sequence_string FROM target_db WHERE refseq_id LIKE 'NP_%') ) ) t WHERE t.expect < 0.0001 ; BLAST in DBMS Online Data Integration BLAST Target DB Customization -- Note: Iterate through Yeast ORF Names to perform batch BLAST.
Mission Impossible: Accomplished SELECT a.orf_1, a.orf_2, b.human_refseq, b.e_value , c.human_refseq, c.e_value FROM yeast_interaction a, yeast_human_homolog b, yeast_human_homolog c, y2h_interaction_p d WHERE a.orf_1 = b.yeast_ORF_name and a.orf_2 = c.yeast_ORF_name and ( (b.human_refseq = d.bait_refseq and c.human_refseq = d.prey_refseq) or (b.human_refseq = d.prey_refseq and c.human_refseq = d.bait_refseq) ) ; ORF_1 ORF_2 HUMAN_REFSEQ E_VALUE HUMAN_REFSEQ E_VALUE ------------------------- ------------------------- --------------- ---------- --------------- ---------- YCR002C YHR107C NP_xxxxx1 5.9279E-44 NP_yyyyy1 3.7130E-46 YCR002C YJR076C NP_xxxxx2 5.9279E-44 NP_yyyyy2 1.7807E-48 YJR076C YHR107C NP_xxxxx3 1.9734E-39 NP_yyyyy3 3.7130E-46 YCR002C YHR107C NP_xxxxx4 2.3257E-48 NP_yyyyy4 7.4988E-39 YCR002C YJR076C NP_xxxxx5 2.3257E-48 NP_yyyyy5 1.9734E-39 YJR076C YHR107C NP_xxxxx6 1.7807E-48 NP_yyyyy6 7.4988E-39
Conclusion • Data-driven discovery-oriented bioinformatics framework demands rich bio-specific DBMS support • 10g Regular Expression and BLAST in DBMS features benefit our scientific discovery tasks in interactome studies • Additional enhancements
References • Jake Yue Chen, et al (2003) Initial Large-scale Exploration of Protein-protein Interactions in the Human Brain.Proceedings of the IEEE Computer Science Society Bioinformatics Conference 2003. Stanford University, Stanford, CA. • Sudhir Sahasrabudhe and Chen, Jake Yue (2003) Extracting Biological Information from System-scale Protein Interactome Data.Tutorial at the 11th International Conference on Intelligent Systems in Molecular Biology. Brisbane, Australia. • Jake Yue Chen and John Carlis (2003) Similar_Join: Extending DBMS with a Bio-specific Operator.Proceedings of the 2003 ACM Symposium on Applied Computing. Melbourne, Florida. • Jake Yue Chen and John Carlis (2003) Genomic Data Modeling.Information Systems, Vol 28, issue 4: p287-310.