510 likes | 688 Views
Transcription Regulation Transcription Factor Motif Finding. Xiaole Shirley Liu STAT115, STAT215. Restaurant Dinner. Home Lunch. Certain recipes used to make certain dishes. Imagine a Chef. Each Cell Is Like a Chef. Adult Liver. Infant Skin. Glucose, Oxygen, Amino Acid. Fat, Alcohol
E N D
Transcription RegulationTranscription Factor Motif Finding Xiaole Shirley Liu STAT115, STAT215
Restaurant Dinner Home Lunch Certain recipes used to make certain dishes Imagine a Chef
Adult Liver Infant Skin Glucose, Oxygen, Amino Acid Fat, Alcohol Nicotine Certain genes expressed to make certain proteins Healthy Skin Cell State Disease Liver Cell State Each Cell Is Like a Chef
Understanding a Genome Get the complete sequence (encoded cook book) Observe gene expressions at different cell states (meals prepared at different situations) Decode gene regulation (decode the book, understand the rules)
Coding region 2% What is to be made Milk->Yogurt Egg->Omelet Fish->Sushi Flour->Cake Beef->Burger Information in DNA ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCACATCGCATCACAGTTCAGGACTAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGCGAATTGCCT
Morning Morning Butter Japanese Restaurant 5 Oz Butter 9 Oz Information in DNA Non-coding region 98% Regulation: When, Where, Amount, Other Conditions, etc ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCACATCGCATCACTACGACGGACTAGACACGGACG GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT Coding region 2% Milk->Yogurt Egg->Omelet Fish->Sushi Flour->Cake Beef->Burger
Measure Gene Expression • Microarray or SAGE detects the expression of every gene at a certain cell state • Clustering find genes that are co-expressed (potentially share regulation)
Scrambled Egg Bacon Cereal Hash Brown Orange Juice Decode Gene Regulation Look at genes always expressed together: Upstream RegionsCo-expressed Genes GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT STAT115, 04/01/2008
Decode Gene Regulation Look at genes always expressed together: Upstream RegionsCo-expressed Genes GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice STAT115, 04/01/2008
Morning Decode Gene Regulation Look at genes always expressed together: Upstream RegionsCo-expressed Genes GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAGTTCAGACACGGACGGC GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT Scrambled Egg Bacon Cereal Hash Brown Orange Juice STAT115, 04/01/2008
Hemoglobin Beta atttgctt ttcact gcaacct aactccagt Hemoglobin Zeta gcaacct actca Hemoglobin Alpha gcaacct Hemoglobin Gamma ccagcgccg gcaacct Transcription Factor (TF) TF Binding Motif gcaacct Biology of Transcription Regulation ...acatttgcttctgacacaactgtgttcactagcaacctca...aacagacaccATGGTGCACCTGACTCCTGAGGAGAAGTCT... ...agcaggcccaactccagtgcagctgcaacctgcccactcc...ggcagcgcacATGTCTCTGACCAAGACTGAGAGTGCCGTC... ...cgctcgcgggccggcactcttctggtccccacagactcag...gatacccaccgATGGTGCTGTCTCCTGCCGACAAGACCAA... ...gccccgccagcgccgctaccgccctgcccccgggcgagcg...gatgcgcgagtATGGTGCTGTCTCCTGCCGACAAGACCAA... Motif can only be computational discovered when there are enough cases for machine learning
Computational Motif Finding • Input data: • Upstream sequences of gene expression profile cluster • 20-800 sequences, each 300-5000 bps long • Output: enriched sequence patterns (motifs) • Ultimate goals: • Which TFs are involved and their binding motifs and effects (enhance / repress gene expression)? • Which genes are regulated by this TF, why is there disease when a TF goes wrong? • Are there binding partner / competitor for a TF?
Water Water Water Water Water Challenges: Where/what the signal The motif should be abundant GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAAATAAGACACGGACGGC GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
Coconut Coconut Coconut Coconut Coconut Challenges: Where/what the signal The motif should be abundant And Abundant with significance GAAATATGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATTTACCACCAAATAAGACACGGACGGC GCCTCGAAATAGCCATTTACCGTTCAAACCTGACTAAA TCTCGTATTTACCATATTAAATACCCACATCGAGAGCG CGCTAGCAAATATACGATTTACCTCGAGAATTGCCTAT
||||||||||||||||||||||||||||| GTGTAGCGTACCATTTATGGTCAAGTCTG ||||||||||||||||||||||||||||| AGAGTCCATTTAGTCAGTATGATGGGTGT Challenges: Double stranded DNA Motif appears in both strands GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG
Challenges: Base substitutions Sequences do not have to match the motif perfectly, base substitutions are allowed GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC CACATCGCATATGTACCACCAGTTCAGACACGGACGGC GCCTCGATTTGCCGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATATTTATCACCCACATCGAGAGCG CGCTAGCCAATTACCGATCTTGTTCGAGAATTGCCTAT
Challenges: Variable motif copies Some sequences do not have the motif Some have multiple copies of the motif GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
Sushi Fish Fish Fish Hand Roll Fish Fish Fish Fish Sashimi Tempura Sake Challenges: Variable motif copies Some sequences do not have the motif Some have multiple copies of the motif GATGATGCCTCGGACGGATATGCCCTACGACCTCTCGC CACATCGCAATGCAGCAATGCGTTCAGACACGGACGGC TCATGCTAATGCCAGTCATGCTACATGCATCGAGAGCG GCCTCTAGCTAGGCCGGTGAACATCAGACCTGACTAAA CGCAATATAGCATTAGCAGACAGACGAGAATTGCCTAT
Coconut Milk or palindromic patterns AATGCG GCGTAA Challenges: Two-block motifs Some motifs have two parts GACACATTTACCTATGCTGGCCCTACGACCTCTCGC CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA TCTCGTTAGATTTACCACCCATGGCCGTATCGAGAGCG CGCTAGCCATTTACCGAT TGGCGTTCTCGAGAATTGCCTAT
Scan for Known TF Motif Sites • Experimental TF sites: TRANSFAC, JASPAR • Motif representation: • Regular expression: Consensus CACAAAA binary decision Degenerate CRCAAAW IUPAC A/G A/T
A adenosine C cytidine G guanine T thymidine U uridine R G A (purine) Y T C (pyrimidine) K G T (keto) M A C (amino) S G C (strong) W A T (weak) B C G T (not A) D A G T (not C) H A C T (not G) V A C G (not T) N A C G T (any) IUPAC for DNA
Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Scan for Known TF Motif Sites • Experimental TF sites: TRANSFAC, JASPAR • Motif representation: • Regular expression: Consensus CACAAAA binary decision Degenerate CRCAAAW • Position weight matrix (PWM): need score cutoff Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Sites
A Word on Sequence Logo • SeqLogo consists of stacks of symbols, one stack for each position in the sequence • The overall height of the stack indicates the sequence conservation at that position • The height of symbols within the stack indicates the relative frequency of nucleic acid at that position ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG
JASPAR • User defined cutoff to scan for a particular motif
Drawbacks to Known TF Motif Scans • Limited number of motifs • Limited number of sites to represent each motif • Low sensitivity and specificity • Poor description of motif • Binding site borders not clear • Binding site many mismatches • Many motifs look very similar • E.g. GC-rich motif, E-box (CACGTG)
De novo Sequence Motif Finding • Goal: look for common sequence patterns enriched in the input data (compared to the genome background) • Regular expression enumeration • Pattern driven approach • Enumerate k-mers, check significance in dataset • Position weight matrix update • Data driven approach, use data to refine motifs • EM & Gibbs sampling • Motif score and Markov background
Expected occurrence of w in the data pw from genome background size of sequence data Regular Expression Enumeration • Oligonucleotide Analysis: check over-representation for every w-mer: • Expected w occurrence in data • Consider genome sequence + current data size • Observed w occurrence in data • Over-represented w is potential TF binding motif Observed occurrence of win the data
Suffix Tree for Fast Search • Weeder, Pavesi & Pesole 2006 • Construction is linear in time and space to length of S. • Quickly locating a substring allowing a certain number of mistakes • Provides first linear-time solutions for the longest common substring problem • Typically requires significantly more space than storing the string itself.
Regular Expression Enumeration • RE Enumeration Derivatives: • oligo-analysis, spaced dyads w1.ns.w2 • IUPAC alphabet • Markov background (later) • 2-bit encoding, fast index access • Enumerate limited RE patterns known for a TF protein structure or interaction theme • Exhaustive, guaranteed to find global optimum, and can find multiple motifs • Not as flexible with base substitutions, long list of similar good motifs, and limited with motif width
Expectation Maximization and Gibbs Sampling Model • Objects: • Seq: sequence data to search for motif • 0: non-motif (genome background) probability • : motif probability matrix parameter • : motif site locations • Problem: P(, | seq, 0) • Approach: alternately estimate • by P( | , seq, 0) • by P( | , seq, 0) • EM and Gibbs differ in the estimation methods
E step: | , seq, 0 TTGACGACTGCACGT TTGAC p1 TGACG p2 GACGA p3 ACGAC p4 CGACT p5 GACTG p6 ACTGC p7 CTGCA p8 ... P1 = likelihood ratio = P(TTGAC| ) P(TTGAC| 0) Expectation Maximization p0T p0T p0G p0A p0C = 0.3 0.3 0.2 0.3 0.2
E step: | , seq, 0 TTGACGACTGCACGT TTGAC p1 TGACG p2 GACGA p3 ACGAC p4 CGACT p5 GACTG p6 ACTGC p7 CTGCA p8 ... M step: | , seq, 0 p1 TTGAC p2 TGACG p3 GACGA p4 ACGAC ... Scale ACGT at each position, reflects weighted average of Expectation Maximization
M Step TTGACGACTGCACGT 0.8 TTGAC 0.2 TGACG 0.6 GACGA 0.5 ACGAC 0.3 CGACT 0.7 GACTG 0.4 ACTGC 0.1 CTGCA 0.9 TGCAC …
EM Derivatives • First EM motif finder (C Lawrence) • Deterministic algorithm, guarantee local optimum • MEME (TL Bailey) • Prior probability allows 0-n site / sequence • Parallel running multiple EM with different seed • User friendly results
Gibbs Sampling • Stochastic process, although still may need multiple initializations • Sample from P( | , seq, 0) • Sample from P( | , seq, 0) • Collapsed form: • estimatedwith counts, not sampling from Dirichlet • Sample site from one seq based on sites from other seqs • Converged motif matrix and converged motif sites represent stationary distribution of a Markov Chain
Gibbs Sampler nA1 + sA nA1 + sA +nC1 + sC +nG1 + sG +nT1 + sT estimated with counts pA1 = 1 11 2 21 31 3 4 41 51 5 Initial 1 • Randomly initialize a probability matrix
1 Without 11Segment Gibbs Sampler • Take out one sequence with its sites from current motif 11 21 31 41 51
1 Without 11Segment Gibbs Sampler • Score each possible segment of this sequence Sequence 1 Segment (1-8) 21 31 41 51
Gibbs Sampler 1 Without 11Segment • Score each possible segment of this sequence Sequence 1 Segment (2-9) 21 31 41 51
Motif Matrix Pos 12345678 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Segment ATGCAGCT score = p(generate ATGCAGCT from motif matrix) p(generate ATGCAGCT from background) p0A p0T p0G p0C p0A p0G p0C p0T Sites Segment Score • Use current motif matrix to score a segment
Scoring Segments Motif 1 2 3 4 5 bg A 0.4 0.1 0.3 0.4 0.2 0.3 T 0.2 0.5 0.1 0.2 0.2 0.3 G 0.2 0.2 0.2 0.3 0.4 0.2 C 0.2 0.2 0.4 0.1 0.2 0.2 Ignore pseudo counts for now… Sequence: TTCCATATTAATCAGATTCCG… score TAATC … AATCA 0.4/0.3 x 0.1/0.3 x 0.1/0.3 x 0.1/0.2 x 0.2/0.3 = 0.049383 ATCAG 0.4/0.3 x 0.5/0.3 x 0.4/0.2 x 0.4/0.3 x 0.4/0.2 = 11.85185 TCAGA 0.2/0.3 x 0.2/0.3 x 0.3/0.3 x 0.3/0.2 x 0.2/0.3 = 0.444444 CAGAT …
12 Modified 1 estimated with counts Gibbs Sampler • Sample site from one seq based on sites from other seqs 21 31 41 51
Hill Climbing vs Sampling • Rand(subtotal) = X • Find the first position with subtotal larger than X
1 Without 21Segment Gibbs Sampler • Repeat the process until motif converges 21 12 31 41 51
Gibbs Sampler Intuition • Beginning: • Randomly initialized motif • No preference towards any segment
Gibbs Sampler Intuition • Motif appears: • Motif should have enriched signal (more sites) • By chance some correct sites come to alignment • Sites bias motif to attract other similar sites
Gibbs Sampler Intuition • Motif converges: • All sites come to alignment • Motif totally biased to sample sites every time
Gibbs Sampler 1 2 3 4 5 1i 2i 3i 4i 5i • Column shift • Metropolis algorithm: • Propose * as shifted 1 column to left or right • Calculate motif score u() and u(*) • Accept * with prob = min(1, u(*) / u())