310 likes | 456 Views
Transcription Factor Motif / Module Finding. Xiaole Shirley Liu STAT115, STAT215. Motif Matrix. Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG. Segment ATGCAGCT score =
E N D
Transcription Factor Motif / Module Finding Xiaole Shirley Liu STAT115, STAT215
Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Sites Scoring Motifs • Information Content (aka relative entropy) • Suppose you have x aligned segments for the motif • pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg)
Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Sites Scoring Motifs • Information Content (aka relative entropy) • Suppose you have x aligned segments for the motif • pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg)
Scoring Motifs pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg) = (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2… Take log of this: = A1 log (pA1/pA0) + T1 log (pT1/pT0) + T2 log (pT2/pT0) + G2 log (pG2/pG0) + … Divide by the number of segments (if all the motifs have same number of segments) = pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)… Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
= Motif Conservedness: How likely to see the current aligned segments from this motif model Bad AGGCA ATCCC GCGCA CGGTA TGCCA ATGGT TTGAA Good ATGCA ATGCC ATGCA ATGCA TTGCA ATGGA ATGCA Scoring Motifs • Original function: Information Content
= Motif Specificity: How likely to see the current aligned segments from background Scoring Motifs • Original function: Information Content Good AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC Bad ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA
= Scoring Motifs • Original function: Information Content Which is better? (data = 8 seqs) Motif 1 AGGCTAAC AGGCTAAC Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCAAC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC
Specific (unlikely in genome background) Motif Signal Abundant Positions Conserved Scoring Motifs • Motif scoring function: • Prefer: conserved motifs with many sites, but are not often seen in the genome background
Prefers motif segments enriched only in data, but not so likely to occur in the background Segment ATGTA score = p(generate ATGTA from ) p(generate ATGTA from 0) 3rd order Markov dependency p( ) Markov Background Increases Motif Specificity TCAGC = .25 .25 .25 .25 .25 .3 .18 .16 .22 .24 ATATA = .25 .25 .25 .25 .25 .3 .41 .38 .42 .30
Position Weight Matrix Update • Advantage • Can look for motifs of any widths • Flexible with base substitutions • Disadvantage: • EM and Gibbs sampling: no guaranteed convergence time • No guaranteed global optimum
Motif Finding in Bacteria • Promoter sequences are short (200-300 bp) • Motif are usually long (10-20 bases) • Some have two blocks with a gap, some are palindromes • Long motifs are usually very degenerate • Single microarray experiment sometimes already provides enough information to search for TF motifs
Motif Regression with Expression • For each TF: Upstream Downstream Seq Mtf Match Gene Diff Exp Gene1 3.2 1.8 Gene2 2.8 0.3 Gene3… • Upstream sequence X motif matching score measures: • Number of sites • Strength of matching Bussemaker et al, Nat Genet 2001
Linear Regression Example Person IQ Age Education Height Eye color Spend/week # iDevices A 120 30 High 171 blue $4000 30 B 250 41 PhD 155 brown $1500 18 C 150 8 Grade10 115 black $100 90 D 180 16 Grade12 140 gray $200 15 E 90 4 Preschool 88 green $500 26 F 130 17 High 178 black $80 500 G 110 21 College 182 blue $800 220 … Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6 Single X X X -- -- -- Regression
Motif Finding in Lower Eukaryotes • Upstream sequences longer (500-1000 bp), with some simple repeats • Motif width varies (5 – 17 bases) • Expression clusters provide decent input sequences quality for TF motif finding • Motif combination and redundancy appears, although single motifs are usually significant enough for identification
Yeast Promoter Architecture • Co-occurring regulators suggest physical interaction between the regulators
Introducing Sequence Conservation Kellis, et al, Nat 2003
Motif Finding in Higher Eukaryotes • Regulatory sequences very long with repeats and far from target genes (enhancers) • Motifs can be short or long (6-20 bases), and appear in combination and clusters • Gene expression cluster not good enough input • Need: • Better known motifs: PBM and SELEX • Comparative Genomics: phastcons score • Motif modules: motif clusters • ChIP-chip/seq
Better Motif Matrices I: PBM • Protein Binding Microarrays • In vitro protein-DNA interactions • Better capture motifs Badis et al, Science 2009
Better Motif Matrices II: SELEX Jolma et al, Cell 2013
UCSC PhastCons Conservation • Functional regulatory sequences are under stronger evolutionary constraint • Align orthologous sequences together • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC
Ultra Conserved Elements • > 200bp ultra conserved in vertebrates • Exonic enriched in RNA processing • Non-exonic enriched in TF binding sites for developmental genes Bejerano et al, Science 2004
Conservation vs Functions • Non-conservation <> non-function • Human Accelerated Region enriched in neurodevelopment Prabhakar et al, Science 2008
PreMod: motif clusters in conserved regions Blanchette et al, Genome Res 2006
ChIP-Seq ChIP-DNA Noise Map 30-50 mers back to the genome Sequence millions of 30-50 mer ends of fragments
Summary • Motif scores consider entropy and abundance • Markov background avoids redundant seqs • Motif occurrence regressing with differential expression • Better new motifs from PBM and SELEX • Motif conservation: PhaseCons, ultra conserved, and human accelerated • Motif clusters to form TF motif modules • ChIP-seq: genome-wide in vivo TF binding sites