260 likes | 457 Views
Regulatory sequence analysis based on a probabilistic model of evolution. Xin He 04/24/2008. cis-Regulatory Modules. Regulatory genomics It is when and where genes are expressed, not whether the genes exist in the genome per se , that is important for development and evolution
E N D
Regulatory sequence analysis based on a probabilistic model of evolution Xin He 04/24/2008
cis-Regulatory Modules • Regulatory genomics It is when and where genes are expressed, not whether the genes exist in the genome per se, that is important for development and evolution • How does a gene know when and where it should be expressed? The cis-regulatory sequence (CRM) of a gene is the information processing machinery that interprets the environmental signals
A. Expression pattern of Drosophila gene eve in early development B. The cis-regulatory module that controls the expression at stripe 2 C. The stripe 2 CRM interprets the expression patterns of transcription factors involved: only express eve at a narrow band
Properties of CRM • CRM contains a cluster of transcription factor binding sites (TFBS) • CRM is generally more conserved than other non-coding sequences in the genome
Problems Problem 1. alignment of CRM sequences from two or multiple species Problem 2. predict locations of CRM in the genome Problem 3. study the evolution of TFBS, in particular their gain and loss
Problems of the Existing Approach • Generic procedure • Alignment of sequences from multiple species • Analysis of the conservation pattern of sequences • Problems • Fixed alignment from general-purpose alignment programs that do not utilize the special properties of regulatory sequences • Often assume TFBS are always conserved
Ambiguous Alignments Observation: the bases GC could be put at 2 different positions with exactly the same total alignment scores. Alignment methods without knowing TFBS information will always make arbitrary decisions. Idea: use TFBS as anchors and construct the alignment?
Gain and Loss of TFBS • TFBS are not always conserved • Existing TFBS may be lost during evolution • New TFBS may be created during evolution • Evidence • Sequence comparison: yeast, fly, mouse, human, etc. • ChIP-chip analysis of yeast TFs
MORPH Alignment Observation: the only existing tool that does alignment of CRM while allowing species-specific TFBS is MORPH. However, if a TFBS is not conserved, it will not be aligned with its orthologous site.
Our Approach • Main ideas • Simultaneous sequence alignment, TFBS annotation and CRM prediction • Allowing TFBS gain and loss • Based on a probabilistic model of CRM evolution “The study of biological sequence data should not be divorced from the process that created it.” TKF
Model Overview • CRM model: background sequence and TFBS Two-state HMM: if at the background state, emit a single nucleotide from the background frequencies; if at the motif state, emit a binding site from PWM of this motif • Evolution of CRM • Evolution of the background sequences • Evolution of TFBS • Switching between background and TFBS, i.e. TFBS gain and loss
Insertion/Deletion Model • Model: • At each position: rate of λ to insert some sequence and rate of µ to delete some sequence starting from this position • The length of sequence inserted or deleted follows the geometric distribution with parameter r Transition probability: Poisson process with rate λ or µ. Assuming the rates are small, the probability of an indel event in time t is λt or µt. x: # # # # _ _ _ # y: # _ _ # # # # #
Insertion/Deletion Model Joint probability: the ancestor sequences are never observed, thus need to sum over all possible histories. _ # t1 t2 t1 t2 _ _ # # Similarly: For the previous example:
TFBS Evolution Model • Intuition: the more important positions in the motif tend to be more conserved (evolve more slowly) • Model: Halpern-bruno model. The rate of nucleotide substitution a to b in the i-th position depends on: • The mutation rate of a to b, denoted as Q0(a,b); • The equilibrium distribution of a and b, denoted as pi(a) and pi(b). If a is more frequent in the equilibrium (larger in the PWM), then more likely to change from b to a.
Model of TFBS Gain and Loss • Fitness of a site depends on its binding energy • There is an energy threshold below which natural selection will not work • Loss: if during evolution, the energy of a TFBS “accidentally” drops below the threshold, natural selection will no longer be able to perceive it. This TFBS is lost (switch to the background mode of evolution) • Gain: if during evolution, the energy of a background site happens to reach the threshold, this site is immediately “visible” to natural selection, and a new TFBS is born (switch to the TFBS mode of evolution)
Probability Computation for TFBS Gain and Loss x t’ z z’ Loss t - t’ y The energy of z and z’ satisfy: E(z) > E0 and E(z’) < E0 where E0 is the energy threshold. Notation: Ψ,Ψ0 represent TFBS and background models respectively. Q is the transition rate matrix of the TFBS model
Extracting Functional Switching Events Parsimony assumption: suppose z is an intermediate site between x and y, then the symbol at any position of z is either the symbol of x or of y in that position. z could be: {TAATC, T_ACC, T_ATC, TAACC} x = TAATC y = T_ACC • Procedure (for extracting loss events): • Enumerate all possible intermediates between x and y, find those whose energy is larger than the threshold • For any such intermediate functional site z, find its neighbor z’ (related by a single mutation event) that is not functional. • Output all such pairs (z,z’)
Loss Gain Functional Switching in a Two-species Tree
Formulation of the Model Parameters and variables: suppose there are K motifs w1, …, wK : the weight of the motifs in the CRM w0: the weight of the background in the CRM π: the stationary distribution of the background nucleotides Ψ0 : the background substitution model λ, μ: insertion and deletion rate Ψk: the substitution model of the k-th motif • Problem: given a putative CRM sequences S1 and S2 from two species with branch lengths equal to t1 and t2: • Compute the maximum posterior probability alignment; • Compute the probability that S1 and S2 are generated from the CRM evolution model.
Dynamic Programming Algorithm Idea: suppose the problem is to compute the probability of two sequences under CRM evolution model, sum over all possible alignments and TFBS annotations with dynamic programming. Variables: let L(i,j) = P(S1[1..i], S2[1..j]|t1,t2) and define be the probability of S1[1..i] and S2[1..j] where the last site is k-th TFBS or background (if k = 0) with state equal to a (a = 0,1,2) # # # _ _ #
Dynamic Programming Case 1: the last site is a matched background column Case 2, 3: the last site is a gap in the background sequence
Dynamic Programming Case 4: the last site is a conserved TFBS • Case 5,6: the last site is a non-conserved TFBS • Length change in TFBS: the length of the non-functional site may change, thus in the algorithm, need to enumerate all possible lengths; • Indels crossing the background-TFBS boundary: the non-functional site may start with a gap, which should be merged with the previous gap (single event in fact)
CRM Prediction Method: to predict if a sequence is CRM, test the hypothesis about the generation of this sequence and its orthologous sequence: M1: the evolutionary model of CRM, i.e. motif weights > 0 M0: the evolution model of the background sequence, i.e. motif weights = 0 Suppose θ represents all unknown parameters, do log-likelihood ratio test: