Transcription factor binding motifs (part II)

Transcription factor binding motifs (part II) 10/22/07

Information from negative control Motivation: combine information from TF binding and non-binding sequences to identify discriminative information. Methods: • REDUCE (Bussmaker et al. 2001) • Motif Regressor (Conlon et al. 2003)

Motif Regressor Algorithm • Rank all genes by expression and obtain their upstream sequences • Use MDscan to find motifs from most induced and most repressed genes • Score each upstream sequence for matches to each MDscan reported motif • Perform simple linear regression between motif-matching score and gene expression to remove insignificant motifs • Perform stepwise regression on the significant motifs to find groups acting together to affect expression

Motif matching score • Extract upstream sequence Xmg (e.g. 800 bp) from each gene. Define which measures the overall enrichment of a motif. sum over sliding windows

Look for candidate motifs Refine motifs Regress b/t upstream mtf match score and downstream expression Motif Regressor Approach • Look at one expression experiment MDscan Expression log ratio Genes

Motif Regressor Linear Regression

Motif Regressor Linear Regression • Multiple regression model: expression explained as the sum of motifs’ effects Error term Expression of gene g Upstream motif- match score Baseline expression Regression coefficient

Further motif selection by stepwise regression • Stepwise regression to further select significant motifs. • Step 1: Include only intercept • Step 2: Sequentially add new motifs that give the largest reduction in error. • Step 3: Sequentially remove motifs that give the smallest increase in error. • Repeat Steps 2 and 3 until converge.

Application • Yeast cells are grown under amino acid starvation. • Gene expression (~6000 genes) was measured at 30 minutes after amino acid starvation. • Motif Regressor was applied to identify sequence motifs.

Comparative genomics • Evolutionary tree • Darwin’s principle from evolution • Cross-species sequence alignment • Conservation of genes • Conservation of regulatory sequence • Quantifying sequence conservation • Methods • MCS score (Kellis) • Phylocon • Results • Yeast (Kellis) • Advantage: no requirement for prior functional information • Drawback: specie-specific motifs may not be learned (Fraenkel)

Non-uniform conservation rates • Genes are typically conserved • Intergenic regions are typically not conserved • Why?

Motif finding by using multiple genomes • Basic assumption: functional sequences evolve more slowly than non-functional sequences, as they are subject to selection pressure. • Basic approach: • Identify conserved regions by sequence alignment algorithms • Restrict motif finding in conserved regions.

Gal4 motif is highly conserved Motif: Gal4 – CGGNNNNNNNNNNNCCG

Methods • Wasserman et al. 2000 • MCS (Kellis et al. 2003; Xie et al. 2005) • PhyloCon (Wang and Stormo 2003) • EMnEM (Moses et al. 2004) • OrthoMEME (Prakash et al. 2004) • PhyME (Sinha et al. 2004) • CompareProspector (Liu et al. 2004) • PhyloGibbs (Siddharthan et al. 2005) • Ortholog Sampler (Li and Wong 2005) • MultiModule (Zhou and Wong 2005)

MCS Basic Idea frequency p0 pobs Conservation rate Select those highly conserved motifs: pobs >> p0 (Xie et al. 2005)

frequency p0 pobs Conservation rate MCS Definition of MCS: observed frequency total #occurrence expected frequency p0 is estimated by random sampling. Choose cutoff at MCS = 6 (Xie et al. 2005)

Application to human regulatory motifs

Results

Tissue specificity of detected motifs

PhyloCon Basic Idea: (Wang and Stormo 2003) • Both sequence conservation and gene co-regulation information are used for motif finding. • Orthologous regions are viewed as sequence profiles. • Align of sequence profiles instead of sequences. species 1 species 2 species 3 species 4 profile

PhyloCon

Profile Comparison • Compare two columns first. fb = {fA, fC, fG, fT} a column of profile • pb = {pA, pC, pG, pT} background base frequency • nb = {nA, nC, nG, nT} observed counts at the specified position • likelihood ratio: • Log-likelihood ratio:

Profile Comparison background • Compare two columns first • ALLR measures the similarities between two columns. • Sum over ALLR at all positions to get a score comparing two profiles. total counts frequencies

Profile merging • Iteratively merge un-orthologous groups that have high ALLR scores.

Sampling motifs on Phylogenetic trees • Motivation: The alignment-based method does not work well if the species are distant. • Basic idea • Avoid aligning multiple species to gather othorlogous gene information. • Directly model the evolution of the genomic sequences. • Assuming that motifs evolve slower than background sequences.

An evolution model

Evolution model Probability of a nucleotide change

Main Algorithm • Step 1: Building an evolution model. • Motif evolution is modeled by decreasing branch length by a fixed rate, say 50%. • Step 2: Infer model parameters by using a Gibbs sampler.

Limitation of comparative genomics approach • Species-specific motifs cannot be learned from this approach.

Divergence of TF binding Borneman et al. 2007

Divergence of TF binding • Divergence binding can be caused by: • divergence of TF motifs (e.g., Ste12) • or some unknown mechanism (e.g. Tec1) Borneman et al. 2007

Other directions • Combining multiple motif finding algorithms. (e.g. Harbison et al. 2004, Jensen and Liu 2005). • Directly identify TF binding sites through experiments (CHIP-chip). Then apply motif finding algorithms to binding data. experimental data. (e.g. MDscan).

Challenge of Specificity • A 7-mer is expected to occur every 16,384 base pairs by chance • In human, this means 3 X 109 / 16,384 ~ 180,000 sites in total • Total number of genes ~ 25,000 • Most of predicted binding sites are false positives! • Need other restrictive information to reduce false positives.

Some Biological Notes • TF binding does not mean it is functional. • Some TFs always bind to DNA, but they are functional only if they are phosphorylated. • Motif sites contain a large number of false positives. • Motifs are short DNA elements (~10 bp). Higher eukaryotes have large genome size, and these short elements may occur frequently by chance. • Epigenetic factors also play an important role in regulation of TF binding. • Chromatin structure, histone modifications, DNA methylation, etc.

Reading list • Conlon et al. 2003 • Proposed Motif Regressor. Filter out motifs that are unassociated with gene expression changes. • Xie et al. 2005 • MCS. Use comparative approach to identify human regulatory motifs. Highly biological. • Wang and Stormo 2003 • Phylocon. An elegant “multi-gene, multi species” approach for motif finding.

Acknowledgements • X.S.Liu

Transcription factor binding motifs (part II)

Transcription factor binding motifs (part II)

Presentation Transcript

Identification of Transcription Factor Binding Sites

Searching for transcription factor binding sites with TRANSFAC

Finding conserved transcription factor binding sites in promoter sequences

Detection of Transcription Factor Binding Sites

Finding Transcription Factor Binding Sites

The Myc Transcription Factor

Microarrays for transcription factor binding location analysis (chIP-chip)

The Human Transcription Factor Proteome

Finding Transcription Factor Binding Sites

Bio277 Lab 3: Finding Transcription Factor Binding Motifs

Location Analysis of Transcription Factor Binding

Last time … * Constraint on transcription factor binding sites

Finding Transcription Factor Motifs

Cofactor Binding Motifs

Transcription factor binding sites and gene regulatory network

Transcription factor genes

Identification of Transcription Factor Binding Sites

Transcription factor binding motifs (part I)

Detection of Transcription Factor Binding Sites

GeneExpression II: 1. Transcription Factor Binding Sites 2. Microarrays 26 th May, 2010

Location analysis of transcription factor binding sites

The Myc Transcription Factor