1 / 31

Transcription Factor Motif / Module Finding

Transcription Factor Motif / Module Finding. Xiaole Shirley Liu STAT115, STAT215. Motif Matrix. Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG. Segment ATGCAGCT score =

kenley
Download Presentation

Transcription Factor Motif / Module Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transcription Factor Motif / Module Finding Xiaole Shirley Liu STAT115, STAT215

  2. Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T Sites Scoring Motifs • Information Content (aka relative entropy) • Suppose you have x aligned segments for the motif • pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg)

  3. Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A  p0T  p0G  p0C  p0A  p0G  p0C  p0T Sites Scoring Motifs • Information Content (aka relative entropy) • Suppose you have x aligned segments for the motif • pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg)

  4. Scoring Motifs pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg) = (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2… Take log of this: = A1 log (pA1/pA0) + T1 log (pT1/pT0) + T2 log (pT2/pT0) + G2 log (pG2/pG0) + … Divide by the number of segments (if all the motifs have same number of segments) = pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)… Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG

  5. = Motif Conservedness: How likely to see the current aligned segments from this motif model Bad AGGCA ATCCC GCGCA CGGTA TGCCA ATGGT TTGAA Good ATGCA ATGCC ATGCA ATGCA TTGCA ATGGA ATGCA Scoring Motifs • Original function: Information Content

  6. = Motif Specificity: How likely to see the current aligned segments from background Scoring Motifs • Original function: Information Content Good AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC Bad ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA

  7. = Scoring Motifs • Original function: Information Content Which is better? (data = 8 seqs) Motif 1 AGGCTAAC AGGCTAAC Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCAAC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC

  8. Specific (unlikely in genome background) Motif Signal Abundant Positions Conserved Scoring Motifs • Motif scoring function: • Prefer: conserved motifs with many sites, but are not often seen in the genome background

  9. Prefers motif segments enriched only in data, but not so likely to occur in the background Segment ATGTA score = p(generate ATGTA from ) p(generate ATGTA from 0) 3rd order Markov dependency p( ) Markov Background Increases Motif Specificity TCAGC = .25  .25  .25  .25  .25 .3  .18  .16  .22  .24 ATATA = .25  .25  .25  .25  .25 .3  .41  .38  .42  .30

  10. Position Weight Matrix Update • Advantage • Can look for motifs of any widths • Flexible with base substitutions • Disadvantage: • EM and Gibbs sampling: no guaranteed convergence time • No guaranteed global optimum

  11. Motif Finding in Bacteria • Promoter sequences are short (200-300 bp) • Motif are usually long (10-20 bases) • Some have two blocks with a gap, some are palindromes • Long motifs are usually very degenerate • Single microarray experiment sometimes already provides enough information to search for TF motifs

  12. Motif Regression with Expression • For each TF: Upstream Downstream Seq Mtf Match Gene Diff Exp Gene1 3.2 1.8 Gene2 2.8 0.3 Gene3… • Upstream sequence X motif matching score measures: • Number of sites • Strength of matching Bussemaker et al, Nat Genet 2001

  13. Linear Regression Example Person IQ Age Education Height Eye color Spend/week # iDevices A 120 30 High 171 blue $4000 30 B 250 41 PhD 155 brown $1500 18 C 150 8 Grade10 115 black $100 90 D 180 16 Grade12 140 gray $200 15 E 90 4 Preschool 88 green $500 26 F 130 17 High 178 black $80 500 G 110 21 College 182 blue $800 220 … Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6 Single X X X -- -- -- Regression

  14. Motif Finding in Lower Eukaryotes • Upstream sequences longer (500-1000 bp), with some simple repeats • Motif width varies (5 – 17 bases) • Expression clusters provide decent input sequences quality for TF motif finding • Motif combination and redundancy appears, although single motifs are usually significant enough for identification

  15. Yeast Promoter Architecture • Co-occurring regulators suggest physical interaction between the regulators

  16. Introducing Sequence Conservation Kellis, et al, Nat 2003

  17. Motif Finding in Higher Eukaryotes • Regulatory sequences very long with repeats and far from target genes (enhancers) • Motifs can be short or long (6-20 bases), and appear in combination and clusters • Gene expression cluster not good enough input • Need: • Better known motifs: PBM and SELEX • Comparative Genomics: phastcons score • Motif modules: motif clusters • ChIP-chip/seq

  18. Better Motif Matrices I: PBM • Protein Binding Microarrays • In vitro protein-DNA interactions • Better capture motifs Badis et al, Science 2009

  19. Better Motif Matrices II: SELEX Jolma et al, Cell 2013

  20. UCSC PhastCons Conservation • Functional regulatory sequences are under stronger evolutionary constraint • Align orthologous sequences together • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC

  21. Ultra Conserved Elements • > 200bp ultra conserved in vertebrates • Exonic enriched in RNA processing • Non-exonic enriched in TF binding sites for developmental genes Bejerano et al, Science 2004

  22. Conservation vs Functions • Non-conservation <> non-function • Human Accelerated Region enriched in neurodevelopment Prabhakar et al, Science 2008

  23. PreMod: motif clusters in conserved regions Blanchette et al, Genome Res 2006

  24. Chromatin ImmunoPrecipitation (ChIP)

  25. TF/DNA Crosslinking in vivo

  26. Sonication (~500bp)

  27. TF-specific Antibody

  28. Immunoprecipitation

  29. Reverse Crosslink and DNA Purification

  30. ChIP-Seq ChIP-DNA Noise Map 30-50 mers back to the genome Sequence millions of 30-50 mer ends of fragments

  31. Summary • Motif scores consider entropy and abundance • Markov background avoids redundant seqs • Motif occurrence regressing with differential expression • Better new motifs from PBM and SELEX • Motif conservation: PhaseCons, ultra conserved, and human accelerated • Motif clusters to form TF motif modules • ChIP-seq: genome-wide in vivo TF binding sites

More Related