210 likes | 374 Views
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays. Nazif Cihan Tas ctas@cs CMSC 838 Presentation. Motivation. DNA microarrays techniques are used intensely for identification of biological agents Gene Expression Studies Diagnostic Purposes
E N D
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas ctas@cs CMSC 838 Presentation
Motivation • DNA microarrays techniques are used intensely for identification of biological agents • Gene Expression Studies • Diagnostic Purposes • Identification of Microorganisms in samples • Item Extraction • Complex Problem • Find the necessary probes and the temperature • Probe sets should be reliably detect and differentiate target sequences • Large Databases • NEW!! Homologous Genes (how to find specific probes) CMSC 838T – Presentation
Talk Overview • Overview of talk • Motivation • Problem Statement • Algorithm • Mathematical Aspects • Experimentation • Discussion CMSC 838T – Presentation
Problem Statement • Positive Probes • Database set S0 • Target S1 • For each sequence in S1, find at least one probe • For S0 - S1 try to avoid it (but do not care if happens) • High Specificity: # of non-target matches are minimized • High Sensitivity: # of covered target seq. is maximized S0 S1 CMSC 838T – Presentation
Problem Statement • Negative Probes • Determine as few as possible probes which together hybridizes with all sequences in S0 - S1 but with NONE in S1. • High Specificity: No seq. in S1 may hybridize • High Sensitivity: Max # of seq. in S0 - S1 be covered S0 S1 CMSC 838T – Presentation
Probe Design Constraints • Sequence Related • Length of probes • Deviation of melting temperature of probe-target hybrids must be low (for physical reasons) • No self complementary regions longer than four nucleotides (not descriptive enough) • Melting temperatures of target and non-target seq. must be larger than a predefined (too close, too hard to identify) • Ensuring a minimum number of mismatches is enough (homologous sequences) • System Related • Execution Time • Usability CMSC 838T – Presentation
Algorithm • Overview • Probe Generation • Hybridization Prediction • Probe Selection CMSC 838T – Presentation
AlgorithmProbe Generation • Subproblem: • Generate probe candidates for the sequences • Keep the set as small as possible without losing any optimal candidate (exclude infeasible ones) • Suffix Tree • Why? • Allows fast recognition of repetitive subsequences • Identifies non-unique probes (i.e. with more than one target) • Efficient for memory and for T computation (reduce time) • How? • Tree is constructed from the sequences • Traversed (Watson-Crick complement) CMSC 838T – Presentation
Suffix Tree • Input: TACTACA • TACTACA • ACTACA • CTACA • TACA • ACA • CA • A • $ denotes end of string • Constructed in linear time CMSC 838T – Presentation
Probe Generation • Further Improvements • Filters applied for cut off • Probe length (predefined) • G-C content (for temperature) • Self-complementarity • Probes should not contain complements as subsequences • Finally, remove highly conserved (non-specific) regions • Insert into hashtables according to their lengths CMSC 838T – Presentation
Algorithm CMSC 838T – Presentation
AlgorithmHybridization Prediction • Subproblem: • Search for the right probe • Search is expensive, Intelligent Hashing used • Design • A frame is moved over target and nontarget seqs. with several lengths • Previous algorithm (Kaderali 2002): Use the suffix tree • At each step, hash values are calculated. If hit, predict melting temperature, store in hybridization matrix. • If there are too many hits for a probe, then it is not unique, remove it • Why intelligent? • Hash time is linear • Allows inexact matching because of hashing (No analysis) • Parallelization • Several threads are searching for probe targets. • Tree and hashtables are fixed. • One thread writes to the final matrix CMSC 838T – Presentation
Hybridization Prediction • Empirical Simulation: • One million random probe-target pairings generated • Four mismatches or one insertion or deletion plus one strong central mismatch chosen • T<20 C for 93% • Complexity is O ( |S0| |S1| ) • Possible probe candidates is |S1| (linear) • Each position of database S0 must be checked CMSC 838T – Presentation
Algorithm CMSC 838T – Presentation
AlgorithmProbe Selection • Subproblem: • Use the hybridization matrix to finalize the probe selection • We have positive probes and negative probes to proceed • Algorithm Analysis: • For each probe candidate • g: #of matches in S1 • b: #of matches in S0 - S1 • t: highest melting point in S1 • Probes for which g or b values is too large, are removed • Sort according to g,b and t. • Apply Depth First Search • Advantages • Performs well (No comparison though) • Guarantees to choose all specific probes if any were found. • Disadvantages • can NOT guarantee optimal selection in terms of coverage CMSC 838T – Presentation
Negative Probe Selection • Let S2 =S0 - S1 and B subset of S2 . The probes that detect S1 also detects some of B elements. • Algorithm for Negative Probes • Apply probe generating and preselection for B. • Conduct hybridization on B U S1 . • Remove the probes which hybridizes with S1 . • Sort the remaining probes according to their hit number. • Successively select the probes which covers most target seq. • Guarantees optimal solution for coverage and probe number usage CMSC 838T – Presentation
AlgorithmProbe Selection CMSC 838T – Presentation
Mathematical Aspects CMSC 838T – Presentation
Experimentation • Parallelized on SMP platform • Classic worker-producer • Intel Dual Pentium III 933 MHz, 1 GB memory • Test data • ssu rRNA of ARB project • 20.282 ssu rRNA sequences • 1.401 < lengths < 4.179 • %97 of them are shorter than 2.000 bases CMSC 838T – Presentation
Discussion • High Performance • Execution is linear with size of database, decreases if longer probes are used • Low Memory Consumption • Depends on the size of the sequence selection, NOT database size • Automatic Design of Group Probes and negative probes • High Quality Probe Design CMSC 838T – Presentation
Discussion • Comparison with previous work • vs. ARB • Not suited for large scale probe design • vs. LCF • Does not consider highly conserved data • Memory consumption is high • Works well with short probes only • vs. others • Mostly can not deal with insertion and deletions • Execution is slow CMSC 838T – Presentation