450 likes | 580 Views
Transcription factor binding motifs (part II). 10/22/07. Information from negative control. Motivation: combine information from TF binding and non-binding sequences to identify discriminative information. Methods: REDUCE (Bussmaker et al. 2001) Motif Regressor (Conlon et al. 2003).
E N D
Information from negative control Motivation: combine information from TF binding and non-binding sequences to identify discriminative information. Methods: • REDUCE (Bussmaker et al. 2001) • Motif Regressor (Conlon et al. 2003)
Motif Regressor Algorithm • Rank all genes by expression and obtain their upstream sequences • Use MDscan to find motifs from most induced and most repressed genes • Score each upstream sequence for matches to each MDscan reported motif • Perform simple linear regression between motif-matching score and gene expression to remove insignificant motifs • Perform stepwise regression on the significant motifs to find groups acting together to affect expression
Motif matching score • Extract upstream sequence Xmg (e.g. 800 bp) from each gene. Define which measures the overall enrichment of a motif. sum over sliding windows
Look for candidate motifs Refine motifs Regress b/t upstream mtf match score and downstream expression Motif Regressor Approach • Look at one expression experiment MDscan Expression log ratio Genes
Motif Regressor Linear Regression • Multiple regression model: expression explained as the sum of motifs’ effects Error term Expression of gene g Upstream motif- match score Baseline expression Regression coefficient
Further motif selection by stepwise regression • Stepwise regression to further select significant motifs. • Step 1: Include only intercept • Step 2: Sequentially add new motifs that give the largest reduction in error. • Step 3: Sequentially remove motifs that give the smallest increase in error. • Repeat Steps 2 and 3 until converge.
Application • Yeast cells are grown under amino acid starvation. • Gene expression (~6000 genes) was measured at 30 minutes after amino acid starvation. • Motif Regressor was applied to identify sequence motifs.
Comparative genomics • Evolutionary tree • Darwin’s principle from evolution • Cross-species sequence alignment • Conservation of genes • Conservation of regulatory sequence • Quantifying sequence conservation • Methods • MCS score (Kellis) • Phylocon • Results • Yeast (Kellis) • Advantage: no requirement for prior functional information • Drawback: specie-specific motifs may not be learned (Fraenkel)
Non-uniform conservation rates • Genes are typically conserved • Intergenic regions are typically not conserved • Why?
Motif finding by using multiple genomes • Basic assumption: functional sequences evolve more slowly than non-functional sequences, as they are subject to selection pressure. • Basic approach: • Identify conserved regions by sequence alignment algorithms • Restrict motif finding in conserved regions.
Gal4 motif is highly conserved Motif: Gal4 – CGGNNNNNNNNNNNCCG
Methods • Wasserman et al. 2000 • MCS (Kellis et al. 2003; Xie et al. 2005) • PhyloCon (Wang and Stormo 2003) • EMnEM (Moses et al. 2004) • OrthoMEME (Prakash et al. 2004) • PhyME (Sinha et al. 2004) • CompareProspector (Liu et al. 2004) • PhyloGibbs (Siddharthan et al. 2005) • Ortholog Sampler (Li and Wong 2005) • MultiModule (Zhou and Wong 2005)
Methods • Wasserman et al. 2000 • MCS (Kellis et al. 2003; Xie et al. 2005) • PhyloCon (Wang and Stormo 2003) • EMnEM (Moses et al. 2004) • OrthoMEME (Prakash et al. 2004) • PhyME (Sinha et al. 2004) • CompareProspector (Liu et al. 2004) • PhyloGibbs (Siddharthan et al. 2005) • Ortholog Sampler (Li and Wong 2005) • MultiModule (Zhou and Wong 2005)
MCS Basic Idea frequency p0 pobs Conservation rate Select those highly conserved motifs: pobs >> p0 (Xie et al. 2005)
frequency p0 pobs Conservation rate MCS Definition of MCS: observed frequency total #occurrence expected frequency p0 is estimated by random sampling. Choose cutoff at MCS = 6 (Xie et al. 2005)
PhyloCon Basic Idea: (Wang and Stormo 2003) • Both sequence conservation and gene co-regulation information are used for motif finding. • Orthologous regions are viewed as sequence profiles. • Align of sequence profiles instead of sequences. species 1 species 2 species 3 species 4 profile
Profile Comparison • Compare two columns first. fb = {fA, fC, fG, fT} a column of profile • pb = {pA, pC, pG, pT} background base frequency • nb = {nA, nC, nG, nT} observed counts at the specified position • likelihood ratio: • Log-likelihood ratio:
Profile Comparison background • Compare two columns first • ALLR measures the similarities between two columns. • Sum over ALLR at all positions to get a score comparing two profiles. total counts frequencies
Profile merging • Iteratively merge un-orthologous groups that have high ALLR scores.
Sampling motifs on Phylogenetic trees • Motivation: The alignment-based method does not work well if the species are distant. • Basic idea • Avoid aligning multiple species to gather othorlogous gene information. • Directly model the evolution of the genomic sequences. • Assuming that motifs evolve slower than background sequences.
Evolution model Probability of a nucleotide change
Main Algorithm • Step 1: Building an evolution model. • Motif evolution is modeled by decreasing branch length by a fixed rate, say 50%. • Step 2: Infer model parameters by using a Gibbs sampler.
Limitation of comparative genomics approach • Species-specific motifs cannot be learned from this approach.
Divergence of TF binding Borneman et al. 2007
Divergence of TF binding • Divergence binding can be caused by: • divergence of TF motifs (e.g., Ste12) • or some unknown mechanism (e.g. Tec1) Borneman et al. 2007
Other directions • Combining multiple motif finding algorithms. (e.g. Harbison et al. 2004, Jensen and Liu 2005). • Directly identify TF binding sites through experiments (CHIP-chip). Then apply motif finding algorithms to binding data. experimental data. (e.g. MDscan).
Challenge of Specificity • A 7-mer is expected to occur every 16,384 base pairs by chance • In human, this means 3 X 109 / 16,384 ~ 180,000 sites in total • Total number of genes ~ 25,000 • Most of predicted binding sites are false positives! • Need other restrictive information to reduce false positives.
Some Biological Notes • TF binding does not mean it is functional. • Some TFs always bind to DNA, but they are functional only if they are phosphorylated. • Motif sites contain a large number of false positives. • Motifs are short DNA elements (~10 bp). Higher eukaryotes have large genome size, and these short elements may occur frequently by chance. • Epigenetic factors also play an important role in regulation of TF binding. • Chromatin structure, histone modifications, DNA methylation, etc.
Reading list • Conlon et al. 2003 • Proposed Motif Regressor. Filter out motifs that are unassociated with gene expression changes. • Xie et al. 2005 • MCS. Use comparative approach to identify human regulatory motifs. Highly biological. • Wang and Stormo 2003 • Phylocon. An elegant “multi-gene, multi species” approach for motif finding.
Acknowledgements • X.S.Liu