Regulatory Motif Finding

Regulatory Motif Finding Statistical Models for Biological Sequence Motif Discovery, Liu J, Gupta, Liu X, Mayerhofere, Lawrence Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting,Blanchette & Tompa (2002)

“Regulatory Motif Finding” • What is being regulated? • What is a “Motif?” • Why do we want to find them?

Central Dogma of Genetics • It’s “TRUE,” right?! (pict by Andrew Hughes, Rice University) • Yes, but…

Every Protein in Every Cell? • Clearly, there are complicated mechanisms at work • Rhodopsin • But, we have the same DNA in all cells…

Transcriptional Regulation • It is transcription (DNA  RNA) that is being regulated. • RNA Polymerase II, aided by Transcription Factors (TFs) • Where do TFs bind?

Promoter Regions • TATA box – usually ~ 30 bp upstream of gene (pict by Andrew Hughes, Rice University) • But, there are others...Where? What Sequence?

Promoter Sequence • Many different possible locations, sometimes extremely far from the start of transcription! • What Sequence? THAT is the $64k (or $1B) Question…

Motifs • Many different promoter sequences found • Basal: TATA-box (-20), CCAAT-box (-100) • Additional transcriptional regulatory domains • Activators and inhibitors use these domains

Motifs (2) • Not exact sequences – that would be too easy  • Strength of Binding Affects level of promotion/inhibition (C/G vs A/T) • Often are Palindromic (GATATC) • Described either probabilistically with motif logos or with extended single-letter nucleotide codes

Symbol Meaning A Adenine G Guanine C Cytosine T Thymine U Uracil Y pYrimidine (C or T) R puRine (A or G) W "Weak" (A or T) S "Strong" (C or G) K "Keto" (T or G) M "aMino" (C or A) B not A (C or G or T) D not C (A or G or T) H not G (A or C or T) V not T (A or C or G) X,N,? unknown (A or C or G or T) Extended Single-Letter Codes • Letters represent possible bases in each position: • TGASTMA – Promoter Sequence for several oncogenes

Motif Logos • Height of letters represents probability of being found in that location in the motif

Why do we care? • Gene regulation  transcriptional regulation • Can teach us about our complex signaling pathways • Drugs and Money

So…Finding Regulatory Motifs • Statistical Models paper (Liu et al) • Assumes: We have located genes that we expect to be co-regulated (microarrays, co-expression)

So…Finding Regulatory Motifs • Experimental methods of determining TF binding sites (Gel Shift assay, DNA Protection Assay) • Statistical models

Single-Site Model • Assumes: - Each sequence contains 1 motif - Sequences are generated by random draws from {A,C,G,T} with given prior probabilities - Motif has a frequency matrix for each position • Use Gibbs site sampler: Missing Data Problem. Randomly choose motif locations. Then move the motif locations based on P(ak)

Gibbs Sampling Sampling: For every K-long word xj,…,xj+k-1 in x: • Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1) • Pi = Prob[ word | background ] B(xj)…B(xj+k-1) Let Sample a random new position ai according to the probabilities A1,…, A|x|-k+1. Prob 0 |x|

Repetitive Block-Motif Model • View K sequences as one long sequence of length n. Model probability of a motif starting at each position ‘i’. • Problems: - Lose evolutionary relationship between sequences - Allows multiple copies of motif in each sequence - Total number of occurrences unknown

The Rest of the Statistical Models Paper… • Much math: • Scoring motif candidates • Using potential motif dictionaries • Bayesian Prior Probabilities • Finding motifs with insertions in them (“gapped” motifs) • On to: Phylogenetic Footprinting

Phylogenetic Footprinting • Most of paper spent describing background, results • Methods are brief, not too deep

Let Evolution Be Your Guide • Phylogenetic Footprinting – “Identifying regulatory elements by finding unusually well conserved regions in a set of orthologous noncoding DNA sequences from multiple species”

Orthologs and Paralogs Gene duplicate within species: Paralog Same gene in species with common ancestor: Ortholog

Advantages • Doesn’t rely on reliably determining co-regulated genes (single-genome approach, non-trivial!) • Can be used to find regulatory elements specific to one single gene (caveat: conserved across species)

Standard Methods • Usually start with MSA (ProbCons,clustalw) • But, this can lose signal (short regulatory elements ~20bp, long promoter regions ~1000 bp) • Also, if species are evolutionarily close, nonfunctional regions may also be well conserved • Can start with general motif discovery algs (MEME, Consensus, AlignAce, DIALIGN …) • But, these don’t take into account relative phylogenetic relationships of sequences. Will weight closely related sequences too highly

The PF Algorithm Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example (merci, CS262) Size of motif sought: k = 4

AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation

… ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries … ACGG: 2ACGT: 1 ... … ACGG: 1ACGT: 1 \... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: + ... An Exhaustive Algorithm Wu[s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG

Wu[s] =  min ( Wv[t] + h(s, t) ) v:children t ofu Simple Recurrence Words Good: K-mer score at a node is the sum of its children’s best parsimony scores for that k-mer

Wu[s] =  min ( Wv[t] + h(s, t) ) v:children t ofu Average sequence length Number of species Total time O(n k(42k + l)) Motif length Running Time O(k 42k )timeper node

FootPrinterhttp://bio.cs.washington.edu/software.html • Avoids pitfalls of using MSA or general-purpose Motif-finding algorithms • Identifies all DNA motifs that appear to have evolved more slowly than the surrounding sequence • Allows motifs to not appear in all sequences (LexA in gram +/- bacteria)

FootPrinter (2) • “Given n orthologous input sequences and the phylogenetic tree T relating them, [footprinter] is guaranteed to produce every set of k-mers, one from each input sequence, that have a parsimony score at most d with respect to T, where k and d are parameters specified by the user.

Parameters • Can set minimum threshold on fraction of the phylogeny that must be spanned for motifs with each parsimony score ‘s’.

Results • Examine 9 sets of orthologous or paralogous (works for duplicated genes that have since evolved as well) sequences. • Found: many old, + some highly conserved motifs of unknown function (time for the experimentalists!)

One example: Metallothionein Gene Family • Good test family: • Large number of promoter sequences • Wide variety of species • Large number of regulatory elements experimentally verified in several species. • Most binding sites are within 300 bp of start codon (ATG)

Inputs Sequences: 590 bp upstream of the start codon • Most found were present in multiple isoform families – gained accuracy by considering the paralogs, not just the orthologs

But, FootPrinter isn’t Perfect • Some known regulatory binding sites were missed. Why? • Ultimately, must be because the motifs were not well-enough conserved to be detected (but we can discuss more…)

FootPrinter Error (1) • Some binding sites not well matched in other species. Example: Thyroid hormone receptor T3R is conserved within rodents, but not beyond. Would need many closely related species to detect this motif.

FootPrinter Error (2-5) • Some motifs well conserved, but too short • InDels in middle of motif – could allow them, but would get many false +s • Some barely fail to meet statistical thresholds (close but no cigar) • Dimer TFs like two conserved regions with variable internal seq.

Regulatory Motif Finding