410 likes | 629 Views
Lab 6 Motif Analysis. March 5, 2013 Lin Liu Yang Li. EGR-1 (Early growth response protein 1) also known as Zif268 (zinc finger protein 225) or NGFI-A (nerve growth factor-induced protein A) is a protein that in humans is encoded by the EGR1 gene.
E N D
Lab 6Motif Analysis March 5, 2013 Lin Liu Yang Li
EGR-1 (Early growth response protein 1) also known as Zif268 (zinc finger protein 225) or NGFI-A (nerve growth factor-induced protein A) is a protein that in humans is encoded by the EGR1 gene. EGR-1 is a mammalian transcription factor. It was also named Krox-24, TIS8, and ZENK. It was originally discovered in mice.
Motif AnalysisBasics What is a Motif? A pattern common to a set of DNA, RNA or protein sequences that share a common biological property, such as functioning as binding sites for a particular protein.
Motif AnalysisBasics ICk(b) = - pk(b) * log2 [pk(b)/0.25]
Motif AnalysisBasics • Ways of representing a motif • • Consensus sequence • • Regular expression • • Weight matrix/PSPM/PSSM • • More complicated models
Motif AnalysisBasics Where do motifs come from? • Sequences of known common function • Cross-linking/pulldown experiments • SELEX experiments • Multiple sequence alignments
Motif AnalysisBasics Why are they important? • Identify proteins that have a specific property • Can be used to infer which factors regulate which genes • Important for efforts to model gene expression
Motif AnalysisWhich methods have been proposed? • Enumerative (‘dictionary’) - search for a k-mer/set of k-mers/regular expression that is over-represented • Probabilistic Optimization (e.g., Gibbs sampler) - stochastic search of the space of possible PSPMs • Deterministic Optimization (e.g., MEME) - deterministic search of space of possible PSPMs
Related R package • seqLogo • Toy example: • source(“http://bioconductor.org/biocLite.R”) • biocLite(seqLogo) • library(seqLogo) • m <- cbind(c(0.5, 0.2, 0.2, 0.1), c(0.2, 0.6, 0.1, 0.1), c(0.1, 0.05, 0.8, 0.05)) • pwm <- makePWM(m) • seqLogo(pwm)
Decode Gene Regulation Look at genes always expressed together: Upstream RegionsCo-expressed Genes GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Challenges: Base substitutions Sequences do not have to match the motif perfectly, base substitutions are allowed GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT
De novo Sequence Motif Finding • Goal: look for common sequence patterns enriched in the input data (compared to the genome background) • Regular expression enumeration • Pattern driven approach • Enumerate patterns, check significance in dataset • Oligonucleotide analysis, MobyDick • Position weight matrix update • Data driven approach, use data to refine motifs • Consensus, EM & Gibbs sampling • Motif score and Markov background
Expectation Maximization and Gibbs Sampling Model • Objects: • Seq: sequence data to search for motif • 0: non-motif (genome background) probability • : motif probability matrix parameter • : motif site locations • Problem: P(, | seq, 0) • Approach: alternately estimate • by P( | , seq, 0) • by P( | , seq, 0) • EM and Gibbs differ in the estimation methods
E step: | , seq, 0 TTGACGACTGCACGT TTGAC p1 TGACG p2 GACGA p3 ACGAC p4 CGACT p5 GACTG p6 ACTGC p7 CTGCA p8 ... P1 = likelihood ratio = P(TTGAC| ) P(TTGAC| 0) Expectation Maximization p0T p0T p0G p0A p0C = 0.3 0.3 0.2 0.3 0.2
E step: | , seq, 0 TTGACGACTGCACGT TTGAC p1 TGACG p2 GACGA p3 ACGAC p4 CGACT p5 GACTG p6 ACTGC p7 CTGCA p8 ... M step: | , seq, 0 p1 TTGAC p2 TGACG p3 GACGA p4 ACGAC ... Scale ACGT at each position, reflects weighted average of Expectation Maximization
EM Derivatives • First EM motif finder (C Lawrence) • Deterministic algorithm, guarantee local optimum • MEME (TL Bailey) • Prior probability allows 0-n site / sequence • Parallel running multiple EM with different seed • User friendly results
Gibbs Sampling • Stochastic process, although still may need multiple initializations • Sample from P( | , seq, 0) • Sample from P( | , seq, 0) • Collapsed form: • estimatedwith counts, not sampling from Dirichlet • Sample site from one seq based on sites from other seqs • Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain
Gibbs Sampler nA1 + sA nA1 + sA +nC1 + sC +nG1 + sG +nT1 + sT estimated with counts pA1 = 1 11 2 21 31 3 4 41 51 5 Initial 1 • Randomly initialize a probability matrix
1 Without 11Segment Gibbs Sampler • Take out one sequence with its sites from current motif 11 21 31 41 51
1 Without 11Segment Gibbs Sampler • Score each possible segment of this sequence Sequence 1 Segment (1-8) 21 31 41 51
Gibbs Sampler 1 Without 11Segment • Score each possible segment of this sequence Sequence 1 Segment (2-9) 21 31 41 51
Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Sites Segment Score • Use current motif matrix to score a segment
Scoring Segments Motif 1 2 3 4 5 bg A 0.4 0.1 0.3 0.4 0.2 0.3 T 0.2 0.5 0.1 0.2 0.2 0.3 G 0.2 0.2 0.2 0.3 0.4 0.2 C 0.2 0.2 0.4 0.1 0.2 0.2 Ignore pseudo counts for now… Sequence: TTCCATATTAATCAGATTCCG… score TAATC … AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383 ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185 TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444 CAGAT …
12 Modified 1 estimated with counts Gibbs Sampler • Sample site from one seq based on sites from other seqs 21 31 41 51
How to Sample? • Rand(subtotal) = X • Find the first position with subtotal larger than X
1 Without 21Segment Gibbs Sampler • Repeat the process until motif converges 21 12 31 41 51
Gibbs Sampler Intuition • Beginning: • Randomly initialized motif • No preference towards any segment
Gibbs Sampler Intuition • Motif appears: • Motif should have enriched signal (more sites) • By chance some correct sites come to alignment • Sites bias motif to attract other similar sites
Gibbs Sampler Intuition • Motif converges: • All sites come to alignment • Motif totally biased to sample sites every time
Gibbs Sampler 1 2 3 4 5 1i 2i 3i 4i 5i • Column shift • Metropolis algorithm: • Propose * as shifted 1 column to left or right • Calculate motif score u() and u(*) • Accept * with prob = min(1, u(*) / u())
Gibbs Sampling Derivatives • Gibbs Motif Sampler (JS Liu) • Add prior probability to allow 0-n site / seq • Sample motif positions to consider • AlignACE (F Roth) • Look for motifs from both strands • Mask out one motif to find more different motifs • BioProspector (XS Liu) • Use background model with Markov dependencies • Sampling with threshold (0-n sites / seq), new scoring function • Can find two-block motifs with variable gap
Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Sites Scoring Motifs • Information Content (also known as relative entropy) • Suppose you have x aligned segments for the motif • pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg)
Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Sites Scoring Motifs • Information Content (also known as relative entropy) • Suppose you have x aligned segments for the motif • pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg)
Scoring Motifs pb(s1 from mtf) / pb(s1 from bg) * pb(s2 from mtf) / pb(s2 from bg) *… pb(sx from mtf) / pb(sx from bg) = (pA1/pA0)A1 (pT1/pT0)T1 (pT2/pT0)T2 (pG2/pG0)G2 (pC2/pC0)C2… Take log of this: = A1 log (pA1/pA0) + T1 log (pT1/pT0) + T2 log (pT2/pT0) + G2 log (pG2/pG0) + … Divide by the number of segments (if all the motifs have same number of segments) = pA1 log (pA1/pA0) + pT1 log (pT1/pT0) + pT2 log (pT2/pT0)… Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
= Motif Conservedness: How likely to see the current aligned segments from this motif model Bad AGGCA ATCCC GCGCA CGGTA TGCCA ATGGT TTGAA Good ATGCA ATGCC ATGCA ATGCA TTGCA ATGGA ATGCA Scoring Motifs • Original function: Information Content
= Motif Specificity: How likely to see the current aligned segments from background Scoring Motifs • Original function: Information Content Good AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC AGTCC Bad ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA ATAAA
= Scoring Motifs • Original function: Information Content Which is better? (data = 8 seqs) Motif 1 AGGCTAAC AGGCTAAC Motif 2 AGGCTAAC AGGCTACC AGGCTAAC AGCCTAAC AGGCCAAC AGGCTAAC TGGCTAAC AGGCTTAC AGGCTAAC AGGGTAAC
Specific (unlikely in genome background) Motif Signal Abundant Positions Conserved Scoring Motifs • Motif scoring function: • Prefer: conserved motifs with many sites, but are not often seen in the genome background
Prefers motif segments enriched only in data, but not so likely to occur in the background Segment ATGTA score = p(generate ATGTA from ) p(generate ATGTA from 0) 3rd order Markov dependency p( ) Markov Background Increases Motif Specificity TCAGC = .25 .25 .25 .25 .25 .3 .18 .16 .22 .24 ATATA = .25 .25 .25 .25 .25 .3 .41 .38 .42 .30