310 likes | 389 Views
Efficient Probe Selection in Micro-array Design. Algorithmics Group, Dept. of Computer Science, University of Liverpool. Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg. Talk Overview.
E N D
Efficient Probe Selectionin Micro-array Design Algorithmics Group, Dept. of Computer Science, University of Liverpool Speaker:Cindy Y. Li Joint work with: Leszek Gąsieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg http://www.csc.liv.ac.uk/~cindy
Talk Overview • Background: Microarrays & Hybridization • Problem Statement • Our Approach • Experimental Work • Conclusion http://www.csc.liv.ac.uk/~cindy
Hybridization Process DNA 5’... TGTGCTTGACAACATAGTTG... 3’ || | | Short DNA Fragments 3’-CTACGGACCGAT-5’ A single-stranded DNA probe (middle panel) is linked to an enzyme and allowed to base pair (hybridize) with the mRNA. After a series of washes, only fragments that are hybridized with the target mRNA remain. http://www.csc.liv.ac.uk/~cindy
Labeled DNA/RNA mixture flushed over array of short DNA fragments Laser activation of fluorescent labels Tool: DNA Microarrays http://www.csc.liv.ac.uk/~cindy
Talk Overview • Background: Microarrays & Hybridization • Problem Statement • Our Approach • Experimental Work • Conclusion http://www.csc.liv.ac.uk/~cindy
Probe concept • A probe is a substring of a gene, which acts as its fingerprint (a.k.a., signature) • Probes are relatively short DNA sequences. Usually, a probe is ~ 20-25 base pairs long. • For example: DNA...TGTGCTTGGCAACATAGATAGATGC... ProbeTGCTTGGCAACATAGATAGA http://www.csc.liv.ac.uk/~cindy
P1 P2 P3 P4 P5 Probes G1 G2 Genes G3 G4 Finding unique probes • We are interested in finding a single (or a small group of) unique probe(s) for each gene • The search process should be both time and space efficient http://www.csc.liv.ac.uk/~cindy
Finding unique probes • Given a database S of gene sequences • For each sequence g in S try tofind a single probe P which hybridizes only with g • If P cross-hybridizes with some other sequences in S (i.e., P has a close occurrence in S) then find a small set of probes that uniquely identifies g. • Sometimes multiple probes are required due to the error prone wet lab environment http://www.csc.liv.ac.uk/~cindy
The use of probes • The uniqueness of probes allows us to identify the genes taking part in the experiment in the wet lab • I.e., seeing the trace (green color) of a number of probes on the microarray we can identify precisely which genes were involved in the experiment http://www.csc.liv.ac.uk/~cindy
Finding Unique Probes - Performance Measure • Each gene in the database S should be uniquely identified by a smallest possible number of probes • The search for probes should be time/space efficient • The time of the search for probes should be “fairly” independent of the length of the probes • All probes should be far (Hamming distance) from each other • Probes should satisfy some extra (e.g., related to hybridization process) conditions Naive approach: Scans through the whole length-n genome for every length-m probe and determine if the Hamming distance is big enough, which takes O(mn2) time. For example, 72 hours for S. pombe genome of length 7.1 x 106 bps and thus impractical for large genome. http://www.csc.liv.ac.uk/~cindy
Previous Work – Approaches based on Suffix array and fast pattern matching[Li F. and Stormo G., 2001] BLAST to avoid cross-hybridization [Rouillard J. M., Herbert C. J. and Zuker M., 2002] Longest common substrings[Rahmann S. 2002] Various filtering techniques[Lockhart DJ et al, 1996] Methods based on pigeon hole principle [Lee W. H. and Sung W. K., 2003] etc http://www.csc.liv.ac.uk/~cindy
Previous Work – The probe selection criteria • No single base exceeds 50% of the probe size • The length of any contiguous As and Ts or Cs and Gs is less than 25% of the probe size • (G+C)% is between 40% and 60% of the probe • Sensitivity - No self-complementarity within the probe sequence • Homogeneity - Melting Temperature not being too low or too high • Specificity – probes are unique to each gene http://www.csc.liv.ac.uk/~cindy
Previous Work – Test data Test data http://www.csc.liv.ac.uk/~cindy
Previous Work – Test data Total length 8,783,280 Total # of genes 5,888 http://www.csc.liv.ac.uk/~cindy
Previous Work http://www.csc.liv.ac.uk/~cindy
Talk Overview • Background: Microarrays & Hybridization • Problem Statement • Our new alternative approach - main observations - the algorithm • Experimental work • Conclusion http://www.csc.liv.ac.uk/~cindy
Main Observations In general randomness help! • 80% of “randomly” (based on our algorithm) chosen candidates for probes satisfy the probe selection criteria related to hybridization process [this suggests that random sequences hybridize properly more likely] • The expected Hamming distance between two randomly chosen sequences of a length n over 4 letter alphabet is ~ 3n/4. [this suggests that randomly chosen probes will be far from each other] http://www.csc.liv.ac.uk/~cindy
An interesting observation • In general, fragments of DNA sequences representing genes are more deterministic (contain more organized information) comparing to the rest of the sequence. • In contrary, the best probes (signatures) representing genes are very likely to be random or almost random! http://www.csc.liv.ac.uk/~cindy
The Algorithm (*) For every gene g in the database S: • generate a random base-pair sequence of length m • find the closest length-m substring P in gene g • check P for good probe criteria[80% pass this test] • If P does not pass the criteria go to a) • cross-hybridization checking for P[98% pass this test] • For every length-m substringQin other sequences S-{g}: • If H(P,Q) > d, P is chosen as the probe for g, goto (*) • Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring P', go to a) http://www.csc.liv.ac.uk/~cindy
R P b) find the closest length-m substring P in gene g g The algorithm (*) For every gene g in the database S: a) generate a random base-pair sequence of length m c) Check Pfor good probe criteria, if P does not pass the criteria, go to a) http://www.csc.liv.ac.uk/~cindy
P is far from g1√ gi g1 Background Sequences Pis far fromg2√ g2 … H(P,Q)<d X Q The algorithm • d) Check P for cross-hybridization checking • For every length-m substringQin other sequences (S - {g}): • If H(P,Q) > d, P is chosen as the probe for g, goto (*); • Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring, go to a) g P Generate another length-m random substring http://www.csc.liv.ac.uk/~cindy
Talk Overview • Background: Microarrays & Hybridization • Problem Statement • Algorithm • Experimental Work • Conclusion http://www.csc.liv.ac.uk/~cindy
Experimental Work For Yeast: • 1.80% genes with no probes • Duplicated / very similar / too short • apart from that • 98.0% genes need only one probe • 1.5% genes need two probes • 0.5% genes need three probes Similar result with genome E.coli http://www.csc.liv.ac.uk/~cindy
Talk Overview • Background: Microarrays & Hybridization • Problem Statement • Algorithm • Experimental Work • Conclusion http://www.csc.liv.ac.uk/~cindy
Conclusion • Almost all (98%) genes can be uniquely identified by a single probe; the others need at most three probes • Our method is: • Suitable for large scale probe design • Fairly independent from the length of probes • Both time and space efficient • Useful in design of fault-tolerant system of probes http://www.csc.liv.ac.uk/~cindy
P2’ P2 g1 P1 P1’ g2 g3 Ongoing Work Distinguish multiple targets in a sample http://www.csc.liv.ac.uk/~cindy
? ? ? Questions http://www.csc.liv.ac.uk/~cindy
Thank You! Presented By Cindy Y. Li http://www.csc.liv.ac.uk/~cindy
self-complementarity Probe 5‘ TTTCAGTAATAAAAGATTTCTGT3‘ |||| Probe 3‘TGTCTTTAGAAAAATTAGACTTT 5‘ http://www.csc.liv.ac.uk/~cindy
Melting Temperature • TM can be used as a parameter to evaluate probe hybridization behavior • TM is calculated for each probe as (SantaLucia et al., 1996) is the sum of the nearest neighbor enthalpy changes is the sum of the nearest neighbor entropy changes R is the Gas Constant (1.987 cal deg-1 mol-1) CTis the total molar concentration of strands () http://www.csc.liv.ac.uk/~cindy
TTTCAGTAATTAAAAAGATTTCTGT -1.2 -1.7 -1.5 kcal/mol Melting Temperature • thermodynamic stability / nearest neighbour/ http://www.csc.liv.ac.uk/~cindy