510 likes | 1.08k Views
DNA Regulatory Binding Motif Search. Dong Xu Computer Science Department 109 Engineering Building West E-mail: xudong@missouri.edu 573-882-7064 http://digbio.missouri.edu. Lecture Outline. Gene regulation Definition of regulatory motif search CONSENSUS (“Greedy” Algorithm)
E N D
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West E-mail: xudong@missouri.edu 573-882-7064 http://digbio.missouri.edu
Lecture Outline • Gene regulation • Definition of regulatory motif search • CONSENSUS (“Greedy” Algorithm) • Gibbs Sampler
Gene Regulation promoter DNA sequence Start of transcription operator
Key steps in transcription • Initiation • Elongation • Termination DNA RNA +
Initiation RNA Polymerases RNA Pol I Ribosomal RNAs RNA Pol II All protein genes, snRNAs U1,U2 etc RNA Pol III Transfer RNAs, ribosomal RNAs TATA TATA A/T A A/T One of the first sequences to be described was the TATA box consensus
RNA pol II TBP TAF TAF TATA Transcription Initiation Complex A macromolecular assembly of approximately 50 proteins Many conserved from yeast to humans TATA-binding protein (TBP) binds to TATA box
Upstream Regulatory Elements In addition to the TATA box the comparison of many eukaryotic upstream sequences identified addition conserved motifs that were involved with the regulation of gene transcription Some UREs were common to many genes others were found only in genes expressed in specific cells or as a result of specific stimuli TATA URE Promoters are sequences in the DNA just upstream of transcripts (coding sequences) that define the sites of initiation
RNA pol II TAF TAF TBP Transcription faction Transcription factors are the proteins that modulate the rate of gene transcription by specific interactions with DNA and/or other proteins motif TATA
Regulatory Network (1) • Regulatory elements in eukaryotes are frequently arranged in “modules”. • Frequently TFs act as synergistic (cooperative) or antagonistic (competitive) pairs. Endo 16
Lecture Outline • Gene regulation • Definition of regulatory motif search • CONSENSUS (“Greedy” Algorithm) • Gibbs Sampler
AGCCA AGCCA AGCCA AGCCA AGCCA Motif Identification Regulatory regions Motif – Binding site??? AGCCA
What constitutes a motif? • In S.cerevisiae typically 6-10 conserved bases – The motif • Spacers varying in length (1-11bp) • Usually located in the middle ACCNNNNNNGTT
Subproblem #1 • Having a collection of known binding sites • Can we develop a representation to search for new binding sites?
Subproblem # 2 • Given a set of sequences containing binding sites for a common factor • Can we discover their location in each sequence?
Computational Approach • Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes). • Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences. • Find some way to identify conserved elements (ungapped pattern) in these sequences, resulting in a list of potential regulatory sites.
Motif Finding Problem • Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern? • Input: a set of sequences, each one with an unknown pattern at an unknown position. • Output: the pattern and a set of starting positions of the pattern in each sequence.
Why Not Use Multiple Alignment • The motif is short and may appear in different location in different sequences. Most other areas are random. • The problem is made more complicated since not every sequence contains a motif, due to: • The upstream region used may not be long enough to include a regulatory site in every sequence. • Usually, potential co-regulated genes are used to construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated.
f(b,i)+ p(b) Log() Frequency matrix
Information Content Values The functional constraints on each specific position of the pattern are variable from some sites absolutely conserved (Shannon’s information content Ci ranging between 0 and 1).
Example Data Set Experimentally determined CRP binding sites for 18 genes
CRP Dimer Homo dimeric structure indicates symmetric model
CPR Product Multinomial Model Logo Palidromic Product Multinomial model of sites
Essentially a Multiple Local Alignment Find “best” multiple local alignment . . .
Difficulties • Multiple factors for a single gene • Variability in binding sites • The nature of variability is NOT well understood • Insertions and deletions are uncommon • Location, location, location… • Confidence assessment
Lecture Outline • Gene regulation • Definition of regulatory motif search • CONSENSUS (“Greedy” Algorithm) • Gibbs Sampler
Early Statistical Approaches • CONSENSUS – Use a greedy algorithm to iteratively build up motifs by adding more and more pattern instances. • Gibbs sampler – Start from a random initial solution, use the Gibbs sampling approach to make a series of local moves, trying to get to the solution with the best score. • MEME – Use the expectaion maxmization (EM) algorithm.
CONSENSUS Algorithm • CONSENSUS uses an iterative procedure to add more and more patterns to form potential motifs: • Initialize each l-mer in sequence 1 as a single-pattern motif. • Add each l-mer in sequence 2 to each single-pattern motif, forming motifs consisting of 2 patterns. Keep only the top n motifs. • Repeat the process by adding each l-mer in sequence 3 to the top n motifs from the last round, forming motifs consisting of 3 patterns, and so on until the last sequence. Only the top n motifs are kept each time.
More Details of CONSENSUS • CONSENSUS use the information content score for scoring a motif as a set of ungapped patterns. • Instead of following the sequence order as given in the input sequence set, a randomized ordering is used to avoid dependence on the input set.
CONSENSUS Procedure (1) Cycle 1: For each word W1 in S1 For each word W2 in S2 Create alignment (gap free) of W1, W2 Keep the n best alignments A1,1, …, An,1 : ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …
CONSENSUS Procedure (2) Cycle t: For each alignment Aj, t-1 from cycle t-1 For each word Wt+1 in St+1 Create alignment (gap free) of Wt+1, Aj, t-1 Keep the n best alignments A1,t, …, An,t ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … … … … ACGGCTC , AGATCTT , GGCGTCT …
Weight matrix • Probabilistic model: How likely is each letter at each motif position? A C G T
A. K. A. Weight matrices are also known as • Position-specific scoring matrices • Position-specific probability matrices • Position-specific weight matrices Related concepts • Information content • Relative entropy
Scoring a motif model • A motif is interesting if it is very different from the background distribution A C G T less interesting more interesting
Relative entropy • A motif is interesting if it is very different from the background distribution • Use relative entropy as objective function: pi, = probability of in matrix position i b = background frequency (in non-motif sequence)
Computational Complexity n is user-defined heuristic constants Running time: O(N2) + O(k N n) Where N: length of sequence; n: top n selections k: number of sequences
Lecture Outline • Gene regulation • Definition of regulatory motif search • CONSENSUS (“Greedy” Algorithm) • Gibbs Sampler
Gibbs Sampling (1) • Goal: find the best ak to maximize the difference between motif and background base distribution. a1 a2 a3 a4 ak
Gibbs Sampling (2) • Step 1: Pick random start position, compute current motif matrix • Step 2: Iterative update • Take one sequence out, update motif matrix • Calcuate fitness score of each position of out sequence • Pick start position in out sequence based on weight Ax • Take out another sequence, …, until converge • Step 3: Reset starting position Liu, X
Gibbs Sampling (3) Take out one sequence, calculate the fitness score of every subsequence relative to the current motif a1' ????????????????? a2' a3' a4' ak'
Fitness Score Current Motif • Ax = Qx / Px • Qx: probability of generating subsequence x from current motif • Px: probability of generating subsequence x from background X = GGA: Q? P? Background: P(A) = P(T) = 0.4 P(G) = P(C) = 0.1
An example ACAGTGT TAGGCGT ACACCGT ??????? CAGGTTT A C G T ACAGTGT TAGGCGT ACACCGT ACGCCGT CAGGTTT sequence 4
Gibbs pseudocode select sites at random compute the relative entropy for (iter = 0; iter < maxiter; iter++) { shuffle(sequences) foreach sequence in (sequences) { assign score to each site in sequence choose one site probabilistically compute the fitness score if (fitness score is best so far) { store a copy of the current sites } } } print the best scoring set of sites
Computational Complexity • One iteration running time: O(NK) • Usually need < N iterations for convergence, and < N starting points. • Overall complexity: unclear – typically O(N2K) - O(N3K) • EM is a local optimization method • Initial parameters matter
Biological Considerations • In practice, motif finding algorithms have to take into account characteristics of real input samples. These include: • Motifs with unknown length. • Samples with biased nucleotide composition. • Corrupted samples (not every sequence contains a motif). • Regulatory sites can lie on either DNA strand.
Reading Assignments • Suggested reading: • Chapter 10 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.” • Optional reading: • Victor Olman, Dong Xu, and Ying Xu. CUBIC: Identifications of Regulatory Binding Sites through Data Clustering. Journal of Bioinformatics and Computational Biology. 1:21-40. 2003.
Develop a program that implement the “greedy” algorithm (CONSENSUS) for motif identification Use an objective function of total mismatches between words. Test the program using the DNA sequence in the next page. Output the motif and location in each sequence. Project Assignment (1)
Project Assignment (2) atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Test DNA sequence (each line a sequence):