230 likes | 368 Views
Random Projection Approach to Motif Finding. Adapted from http://genome.ucsd.edu/classes/be202/ppt/FindingSignals-RandomProjections.ppt. daf-19 Binding Sites in C. elegans (Peter Swoboda). GTT GT CATGGT G AC GTT T CCATGG A AAC G C T A CCATGG C AAC GTT A CCAT A GTAAC
E N D
Random Projection Approach to Motif Finding Adapted from http://genome.ucsd.edu/classes/be202/ppt/FindingSignals-RandomProjections.ppt
daf-19 Binding Sites in C. elegans(Peter Swoboda) GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1
Algorithmic Techniques • MEME (Expectation Maximization) • GibbsDNA (Gibbs Sampling) • CONSENUS (greedy multiple alignment) • WINNOWER (Clique finding in graphs) • SP-STAR (Sum of pairs scoring) • MITRA (Mismatch trees to prune exhaustive search space)
The (l,d) Planted Motif Problem(Sagot 1998, Pevzner & Sze 2000) • Generate a random length l consensus sequence C. • Generate 20 instances, each differing from C by d random mutations. • Plant one at a random position in each of N=20 random sequences of length n=600. • Can you find the planted instances?
Planted Motifs AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC ATGATAGCATCAACCTAACCCTAGATATGGGAT TTTTGGGATATATCGCCCCTACACTGGATGACT GGATATACATGAACACGGTGGGAAAACCCTGAC • Each instance differs from ACAGGATCA by 2 mutations • Remaining sequence random
x(1) ...ccATCCGACca... ...ttATGAGGCtc... ...ctATAAGTCgc... ...tcATGTGACac... x(2) ATGCGTC =M x(5) (7,2) motif x(8) Random Projection Algorithm • Buhler and Tompa (2001) • Guiding principle: Some instances of a motif agree on a subset of positions. • Use information from multiple motif instances to construct model.
k-Projections • Choose k positions in string of length l. • Concatenate nucleotides at chosen k positions to form k-tuple. • In l-dimensional Hamming space, projection onto k dimensional subspace. k = 7 l = 15 P ATGGCATTCAGATTC TGCTGAT P = (2, 4, 5, 7, 11, 12, 13)
TGCACCT Bucket TGCT Random Projection Algorithm Input sequence x(i): …TCAATGCACCTAT... • Choose a projection by selecting k positions uniformly at random. • For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. • Recover motif from bucket containing multiple l-tuples.
ATCCGAC GCTC ATGC Example • l = 7 (motif size) , k = 4 (projection size) • Choose projection (1,2,5,7) Input Sequence ...TAGACATCCGACTTGCCTTACTAC... Buckets GCCTTAC
GCTC CATC ATTC ATGC Hashing and Buckets • Hash function h(x) obtained from k positions of projection. • Buckets are labeled by values of h(x). • Enriched buckets: contain at least sl-tuples, for some parameter s.
ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC Motif Refinement • How do we recover the motif from the sequences in the enriched buckets? • k nucleotides are known from hash value of bucket. • Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler Local refinement algorithm ATGCGTC Candidate motif
ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC Frequency Matrix Model from Bucket Frequency matrix W EM algorithm Refined matrix W*
Motif Finding as Global Optimization • Scoring function (Hamming distance, likelihood ratio, etc.) • Many existing algorithms (MEME, GibbsDNA) are good local optimization routines. • Random projection is a procedure for finding good starting points.
EM Motif Refinement • For each bucket h containing more than s sequences, form weight matrix Wh • Use EM algorithm with starting point Whto obtain refined weight matrix model Wh* • For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | Wh* )/ Pr(y(i) | P0). • T = {y(1), y(2), …, y(N)} • C(T ) = consensus string
Expectation Maximization (EM) • S = { x(1), …, x(N)} : set of input sequences • Given: • W = An initial probabilistic motif model • P0 = background probability distribution. • Find value Wmaxthat maximizes likelihood ratio: • EM is local optimization scheme. Requires starting value W
A Single Iteration • Choose a random k-projection. • Hash each l-mer x in input sequence into bucket labelled by h(x). • From each bucket B with at leasts sequences, form weight matrix model, and perform EM/Gibbs sampler refinement. • Candidate motif is the bestone found from refinement of all enriched buckets.
What is the best motif? • Compute score S for each motif: • Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)} • Return motif with maximal score
Parameter Selection • Projection size k • Choose k small so several motif instances hash to same bucket. (k < l - d) • Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4k Bucket threshold s: (s = 3, s = 4)
How Many Iterations? • Planted bucket : bucket with hash value h(M), where M is motif. • Choose m = number of iterations, such that Pr(planted bucket contains ≥s sequences in at least one of m iterations) ≥ 0.95. • Probability is readily computable since iterations form a sequence of independent Bernoulli trials.
Examples K = set of nt. in motif instances. P = set of nt. in positions predicted by algorithm.