Random Projection Approach to Motif Finding

Random Projection Approach to Motif Finding Adapted from http://genome.ucsd.edu/classes/be202/ppt/FindingSignals-RandomProjections.ppt

daf-19 Binding Sites in C. elegans(Peter Swoboda) GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1

Algorithmic Techniques • MEME (Expectation Maximization) • GibbsDNA (Gibbs Sampling) • CONSENUS (greedy multiple alignment) • WINNOWER (Clique finding in graphs) • SP-STAR (Sum of pairs scoring) • MITRA (Mismatch trees to prune exhaustive search space)

The (l,d) Planted Motif Problem(Sagot 1998, Pevzner & Sze 2000) • Generate a random length l consensus sequence C. • Generate 20 instances, each differing from C by d random mutations. • Plant one at a random position in each of N=20 random sequences of length n=600. • Can you find the planted instances?

Planted Motifs AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC ATGATAGCATCAACCTAACCCTAGATATGGGAT TTTTGGGATATATCGCCCCTACACTGGATGACT GGATATACATGAACACGGTGGGAAAACCCTGAC • Each instance differs from ACAGGATCA by 2 mutations • Remaining sequence random

x(1) ...ccATCCGACca... ...ttATGAGGCtc... ...ctATAAGTCgc... ...tcATGTGACac... x(2) ATGCGTC =M x(5) (7,2) motif x(8) Random Projection Algorithm • Buhler and Tompa (2001) • Guiding principle: Some instances of a motif agree on a subset of positions. • Use information from multiple motif instances to construct model.

k-Projections • Choose k positions in string of length l. • Concatenate nucleotides at chosen k positions to form k-tuple. • In l-dimensional Hamming space, projection onto k dimensional subspace. k = 7 l = 15 P ATGGCATTCAGATTC TGCTGAT P = (2, 4, 5, 7, 11, 12, 13)

TGCACCT Bucket TGCT Random Projection Algorithm Input sequence x(i): …TCAATGCACCTAT... • Choose a projection by selecting k positions uniformly at random. • For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. • Recover motif from bucket containing multiple l-tuples.

ATCCGAC GCTC ATGC Example • l = 7 (motif size) , k = 4 (projection size) • Choose projection (1,2,5,7) Input Sequence ...TAGACATCCGACTTGCCTTACTAC... Buckets GCCTTAC

GCTC CATC ATTC ATGC Hashing and Buckets • Hash function h(x) obtained from k positions of projection. • Buckets are labeled by values of h(x). • Enriched buckets: contain at least sl-tuples, for some parameter s.

ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC Motif Refinement • How do we recover the motif from the sequences in the enriched buckets? • k nucleotides are known from hash value of bucket. • Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler Local refinement algorithm ATGCGTC Candidate motif

ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC Frequency Matrix Model from Bucket Frequency matrix W EM algorithm Refined matrix W*

Motif Finding as Global Optimization • Scoring function (Hamming distance, likelihood ratio, etc.) • Many existing algorithms (MEME, GibbsDNA) are good local optimization routines. • Random projection is a procedure for finding good starting points.

EM Motif Refinement • For each bucket h containing more than s sequences, form weight matrix Wh • Use EM algorithm with starting point Whto obtain refined weight matrix model Wh* • For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | Wh* )/ Pr(y(i) | P0). • T = {y(1), y(2), …, y(N)} • C(T ) = consensus string

Expectation Maximization (EM) • S = { x(1), …, x(N)} : set of input sequences • Given: • W = An initial probabilistic motif model • P0 = background probability distribution. • Find value Wmaxthat maximizes likelihood ratio: • EM is local optimization scheme. Requires starting value W

A Single Iteration • Choose a random k-projection. • Hash each l-mer x in input sequence into bucket labelled by h(x). • From each bucket B with at leasts sequences, form weight matrix model, and perform EM/Gibbs sampler refinement. • Candidate motif is the bestone found from refinement of all enriched buckets.

What is the best motif? • Compute score S for each motif: • Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)} • Return motif with maximal score

Parameter Selection • Projection size k • Choose k small so several motif instances hash to same bucket. (k < l - d) • Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4k Bucket threshold s: (s = 3, s = 4)

How Many Iterations? • Planted bucket : bucket with hash value h(M), where M is motif. • Choose m = number of iterations, such that Pr(planted bucket contains ≥s sequences in at least one of m iterations) ≥ 0.95. • Probability is readily computable since iterations form a sequence of independent Bernoulli trials.

Examples K = set of nt. in motif instances. P = set of nt. in positions predicted by algorithm.

Random Projection Approach to Motif Finding

Random Projection Approach to Motif Finding

Presentation Transcript

Regulatory Motif Finding

Regulatory Motif Finding

DNA Motif Finding

Motif finding: Lecture 1

Motif finding : Lecture 2

Regulatory Motif Finding (II)

(Regulatory-) Motif Finding

Motif finding

Comparative Motif Finding

Motif Finding

Motif Finding

Motif finding

Motif Finding

Motif Finding

Gibbs sampling for motif finding

Motif finding methods and algorithms

Regulatory Motif Finding

Motif Finding

Regulatory Motif Finding

Motif Finding

Gibbs Sampling in Motif Finding

(Regulatory-) Motif Finding