600 likes | 636 Views
Introduction to. Bioinformatics. Introduction to Bioinformatics. LECTURE 10: Identification of regulatory sequences * Chapter 10: A bed-time story. Introduction to Bioinformatics LECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES. 10.1 The circadian clock
E N D
Introduction to Bioinformatics
Introduction to Bioinformatics. LECTURE 10: Identification of regulatory sequences * Chapter 10: A bed-time story
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES • 10.1 The circadian clock • * All living beings have a biological clock (remember jet lag) called the Circadian Rhythm/Clock • * Disruptions between the circadian rhythm and the natural day-night cycle lead to various health problems • * The internal clock synchronizes numerous functions such as metabolism, activity/awareness level, and body temperature • * For plants this is especially true: the photosynthesis
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES * Plants lead a stressful life; they have many needs (water, sun, nutrients) nut are unable to move * Rather than moving, plants react to external stress by changing their internal condition * Herbivore? → Chemical repellent! (e.g. nicotine) * Falling temperature? → Anti-freeze proteins! * Plants that can ‘anticipate’ changes have a competitive advantage → this is the importance of a circadian clock
Arabidopsis thaliana From Wikipedia, the free encyclopedia Scientific classificationKingdom:PlantaeDivision:MagnoliophytaClass:MagnoliopsidaOrder:BrassicalesFamily:BrassicaceaeSubfamily:BrassicoideaeGenus:ArabidopsisSpecies:A. thaliana Arabidopsis thaliana, commonly called arabidopsis, thale cress, or mouse-ear cress, a small flowering plant related to cabbage and mustard, is one of the model organisms for studying plant sciences, including genetics and plant development. It plays the role for agricultural sciences that mice and fruit flies (Drosophila) play in human biology. Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES Arabidopsis thaliana 120 Mbp 5 chromosomes 29,000 genes
- + - + - + - + - + - + Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES * Arabidopsis thaliana has a cell-autonomouscircadian clock: each single cell keeps track of day-night cycle independently awake asleep
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES * If you remove the day-night stimulus and keep A. thaliana in constant light – or dark – then within days the clock looses periodicity. * In contrast, mammals kept in constant light keep the circadian clock running for months. * How does A. thaliana (and other organisms) run their circadian clock? * Three proteins are the key-players: LHY, CCA1, and TOC1:
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES * Transcription factors TF : a protein that regulates transcription. TFs regulate the binding of RNA polymerase and the initiation of transcription. A TF binds upstream or downstream to either enhance or repress transcription of a gene by assisting or blocking RNA polymerase binding. * Transcription Factor Binding Site TFBS : The location on the DNA molecule where a TF can physically attach.
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES TF TF TF TF TF TFBS DNA TF TFBS
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES * A TFBS has a specific sequence of nucleotides for the TF to attach. This is called the motif * Motifs are short (5-15 bp) sequences of nucleotides, e.g. TATAA, TAAAAAAAAAATCTA, TATCTG, … * Different TFs have different TFBS motifs * However, there is some freedom in the motif sequence: a given TF may lock to TATACT, but also to TATAACT and TATACT
expressed gene mRNA protein - - - - TF TF TF TF Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES • 10.2 Basic mechanisms of gene expression • * Statistical and algorithmic issues in finding TFBS motifs • * Protein control: gene transcription control, mRNA control, post-translational control of proteins, … gene
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression Genetic signposts * A gene embedded in random DNA is totally inert * Promotor = regulatory DNA = cis-regulatory DNA * The promotor is a region on the DNA just before (=upstream) the gene that indicates where the transcription starts …
Promotor locking – simplified – Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
Promotor locking – realistic – Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression * A major TFBS is the RNA polymerase binding site * Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA * Eukaryota: has different RNA polymerase → different motifs; TATA-box (= TATAA[A/T]) at ~ -40 * Other docking sites at +/- -1000, but also many other places up to - 250,000 (.. and further???)
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression • Computational challenges in finding TFBS • Finding TFBS motifs is complex: • TFBS are very short and will therefore appear by chance alone • There is a high variability (ATAATC, ATAATT, ATACTC, …) • We don’t know the TFBS motif nor the TFBS location
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression • * Trick 1: area’s on the gene with high conservation • * Trick 2: co-regulated genes (have same TF): look for shared motifs upstream • * For Arabidopsis thaliana: look for motifs upstream bound by LHY and CCA1: • * [i] cluster genes with same day-night oscillatory pattern • * [ii] look in this cluster for shared motifs upstream up to -1000
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES • 10.3 Motif-finding strategies • * Where to look for a TFBS? Look at +/-1000 upstream • * gapped/ungapped motif • * fixed/variable length motif • * TFBS motifs are variable but highly similar • * Consensus sequence: most probable sequence
Consensus motif: a useful notation Introduction to Bioinformatics10.3: Motif-finding strategies
Introduction to Bioinformatics10.3: Motif-finding strategies • PSSM / PSWM / profile • * Position Specific Scoring Matrix: PSSM (also profile) • * PSSM: multinomial model of sequence of length L • * PSSM: multinomial distribution depends on position on the sequence: P[position,symbol], symbol={CTGA} or 20AA
Introduction to Bioinformatics10.3: Motif-finding strategies • Example 10.1: fixed length – ungapped motif • A T G C T G A A T G T A • C T A T A T A G T A A T • C T G T C A A T A T G T • A A C C T A A T T G T T • C A G A T T T C C C A C • C T C G A C A A A T T T • A C T C A G A T T C T C Note: we know neither the place (*) nor the length (6) of the TFBS
Introduction to Bioinformatics10.3: Motif-finding strategies • Example 10.1: fixed length – ungapped motif • A T G *C T G A A T G T A • *C T A T A T A G T A A T • C T G T *C A A T A T G T • A A C *C T A A T T G T T • *C A G A T T T C C C A C • C T C G A *C A A A T T T • A C T *C A G A T T C T C Note: we know neither the place (*) nor the length (6) of the TFBS
Introduction to Bioinformatics10.3: Motif-finding strategies • Example 10.1: fixed length – ungapped motif • *C T G A A T • *C T A T A T • *C A A T A T • *C T A A T T • *C A G A T T • *C A A A T T • *C A G A T T A 0 5 5 5 4 0 C 7 0 0 0 0 0 G 1 0 3 0 0 0 T 0 3 0 3 4 8PSSM C A A A T T consensus motif Alignment
Introduction to Bioinformatics10.3: Motif-finding strategies • Identifying motifs • * Start position, motif sequence and motif length are unknown • * PSSM = scoring from multiple alignment • * What is a significant result: compare the sequence with the background model: the chance based on the current set that the motif occurs by pure chance
Identifying motifs [2] • * Algorithmically finding the motif sequence by optimization of a scoring function is extremely computationally expensive • * Therefore heuristics have been proposed • * Example of a randomized and greedy heuristic is Gibbs sampling • * Now focus on ungappedfixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs [3] • Ungappedfixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana • From example 10-1 the PSSM: Introduction to Bioinformatics10.3: Motif-finding strategies A 0 5/8 5/8 5/8 4/8 0 C 7/8 0 0 0 0 0 G 1/8 0 3/8 0 0 0 T 0 3/8 0 3/8 4/8 8/8
Introduction to Bioinformatics10.3: Motif-finding strategies • Identifying motifs [4] • A motif is interesting if it is unlikely under the background distribution: column 6 is more unbalanced than column 1 • Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) : • pi[k] is probability of observing symbol k at position i • qi[k] is multinomial background model for symbol k at i
Identifying motifs [5] • To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at each entry Introduction to Bioinformatics10.3: Motif-finding strategies A 0 5 5 5 4 0 C 7 0 0 0 0 0 G 1 0 3 0 0 0 T 0 3 0 3 4 8PSSM A 1 6 6 6 5 1 C 8 1 1 1 1 1 G 2 1 4 1 1 1 T 1 4 1 4 5 9PSSM + pseudocounts
Introduction to Bioinformatics10.3: Motif-finding strategies • Identifying motifs [6] • PSSM with pseudocounts has no zeros: A 1/12 6/12 … C 8/12 1/12 … G 2/12 1/12 … T 1/12 4/12 …
Introduction to Bioinformatics10.3: Motif-finding strategies • Finding high-scoring motifs • * Sequence s of length n (> L = length of the PSSM) • * Slide the PSSM along the sequence and compute the likelihood: • * With this algorithm try to find starting position (j with highest value), and most probable motif (argmax of L). • NOTE: in practice use log-likelihood l(j) = log L(j)
Introduction to Bioinformatics10.3: Motif-finding strategies • Finding high-scoring motifs [2] • ALGORITHM FOR FINDING TFBS MOTIFS: • 0. Start with random location j and random PSSM • Iteration: • With fixed j optimize PSSM • With fixed PSSM optimize j • Until the result has converged • This is the EM-algorithm
Introduction to Bioinformatics10.3: Motif-finding strategies • Finding high-scoring motifs [3] • Gibbs sampling to avoid local optima: • Use randomization of the sequence as an alternative for using the location with the highest score • Use a simple assumption: e.g. there is no variation – so look for a fixed sequence • Figure 10-1 shows the log likelihood score and therefore the locations for the optimal locations of the motifs
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES • 10.4 Case study: the circadian rhythm • * Harmers et al. (2000) the clock-regulated elements of A. thaliana are activated in the evening – hence: Evening Element (EE) • * Cluster the expression profiles and consider the clusters with appropriate periodicity : they are candidates for containing the EE
Introduction to Bioinformatics10.4: Case study the circadian rhythm
Introduction to Bioinformatics10.4: Case study the circadian rhythm
Introduction to Bioinformatics10.4: Case study the circadian rhythm
Introduction to Bioinformatics10.4: Case study the circadian rhythm
Introduction to Bioinformatics10.4: Case study the circadian rhythm
Introduction to Bioinformatics10.4: Case study the circadian rhythm METHOD: * Look only at motifs of fixed length (9): consider all words of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data. * Therefore examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse complement). * Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown.
Introduction to Bioinformatics10.4: Case study the circadian rhythm METHOD [2]: * The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out. * After eliminating the repeating element, we can observe that the most significant EE element is the motif AAAATATCT.
Introduction to Bioinformatics10.4: Case study the circadian rhythm METHOD [3]: * The EE element is the motif AAAATATCT. * We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC). * To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.
Introduction to Bioinformatics10.4: Case study the circadian rhythm METHOD [4]: * In 100 trials we never observe a margin larger than 0.000147462. * We can look in detail at the frequency of the evening element among all the clock regulated genes:
Introduction to Bioinformatics10.3: Motif-finding strategies • EE-count and circadian rhythm in genes • Circadian time: 0 4 8 12 16 20 • Number of genes: 78 45 124 67 30 93 • EE count: 5 6 49 27 8 8
Introduction to Bioinformatics10.4: Case study the circadian rhythm METHOD [6]: * The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.
Introduction to Bioinformatics10.3: Motif-finding strategies • Finding the motif-length • Compare the log-likelihood score relative to the background model for motif length L: