720 likes | 957 Views
Comparative genomics to identify DNA binding motifs. Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign. Outline. Binding sites and motifs The motif finding problem in one species Comparative genomics and alignment
E N D
Comparative genomics to identify DNA binding motifs Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign
Outline • Binding sites and motifs • The motif finding problem in one species • Comparative genomics and alignment • The motif finding problem with comparative genomics
Motif finding in multiple species • Footprinter : the approach without alignments • PhyloCon : The use of alignments • PhyME & PhyloGibbs : The use of alignments and an evolutionary model • MCS : Genome-wide motif finding from multiple species
Binding sites • A few binding sites of transcription factor “Bicoid” in the Drosophila (fruitfly) genome, collected experimentally
T A A T C C C Motif http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif
W A A T C C N Motif W = T or A N = A,C,G,T “Consensus String” http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif
Motif • Common sequence “pattern” in the binding sites of a transcription factor • A succinct way of capturing variability among the binding sites
Alternative way to represent motif Position weight matrix (PWM) Or simply, “weight matrix”
Motif representation • Consensus string • May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. • Tractable search space, enumerative algorithms • Position weight matrix • More powerful representation • Probabilistic treatment, algorithms • More popular
The motif finding problem(in one species) • Suppose a transcription factor (TF) regulates five different genes • Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF
The motif finding problem • Now suppose we are given the promoter regions of the five genes G1, G2, … G5 • Can we find the binding sites of TF, without knowing about them a priori ? • Binding sites are similar to each other, but not necessarily identical • This is the motif finding problem • To find a motif that represents binding sites of an unknown TF
Motif finding algorithms • Version 1: Given promoter regions of co-regulated genes, find the motif • Existing algorithms: • Gibbs sampling (MCMC) : Lawrence et al. 1993 • MEME (Expectation-Maximization) : Bailey & Elkan 94 • CONSENSUS (Greedy local search, beam search) : Hertz & Stormo • Word enumeration methods (with emphasis on statistical accuracy) • van Helden et al. 1998, Sinha & Tompa 2000 • And a hundred others
species1 GCGTGATCGAGCTATAACGGAA GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE BLOCKS OF CONSERVATION More Data • Genomes of multiple species available
Using multiple genomes • Functional parts of the genome evolve more slowly than non-functional parts • Identify conserved parts by sequence alignment algorithms • Look for functional features in conserved regions – this improves the signal Popular Paradigm in Computational Biology
Multiple sequence alignment • Comparative genomics relies upon the ability to detect “similar” (evolutionarily related) regions in different genomes • The problem of multiple species alignment • A hard computational problem (“NP-hard”) • Several fast heuristics exist (Mlagan, TBA) • Assume this functionality exists …
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF Back To Motif finding
Motif finding from multiple species data • Version 2: Given promoter regions of same gene • from multiple species, find the motif Species 1 Species 2 Gene G Species 3 Species 4 Species 5 Binding sites for TF
Blocks of conservation One approach • Do multiple sequence alignment of upstream regions of gene Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in conserved blocks
Blocks of conservation Another approach (alignment-free) • What if binding sites are not entirely within conserved blocks? Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in entire upstream regions
Footprinter (Blanchette et al.)The method without alignments
Footprinter • The input sequences are promoter regions of the same gene, but from multiple species. • Such sequences are said to be “orthologous” to each other.
Footprinter Input sequences Related by an evolutionary tree Find motif
A side note: Parsimony • A guiding principle in cross-species comparison • If the data can be explained in multiple ways, prefer the one with the fewer number of events (be parsimonious) • Parsimony score = number of evolutionary events (e.g., substitutions) on the tree • Maximum parsimony principle: minimize parsimony score
Phylogenetic footprinting: formally speaking Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in Tis at most d.
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example Size of motif sought: k = 4
AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation
… ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2ACGT: 1... … ACGG: 1ACGT: 1... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: +... An Exact Algorithm(Blanchette’s algorithm) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s.
Wu [s] = min ( Wv [t] + d(s, t) ) • A post-order traversal algorithm v:child t ofu Recurrence
Wu [s] = min ( Wv [t] + d(s, t) ) v:child t ofu Running Time O(k 42k )timeper node
Footprinter: features • One of the earliest motif-finding algorithms based on comparative genomics • Simple formulation of motif score, algorithm efficient in practice • Cannot combine evolutionary conservation information with overrepresentation information • two motifs, equally conserved, but one occurs in many co-regulated genes (promoters)
The underlying single-species algorithm: CONSENSUS Final goal: Find a set of substrings, one in each input sequence Set of substrings define a PWM. Goal: This PWM should have high information content. High information content means that the motif “stands out”.
The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings.
The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif.
? ? ? ? The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Consider every substring in the next sequence, try adding it to current motif and scoring resulting motif
The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Pick the best one ….
The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. … and repeat Pick the best one ….
The key: Scoring a motif The current motif. Scoring a motif:
The key: Scoring a motif The current motif. Scoring a motif: Build a PWM Compute information content of PWM: For each column, Compute relative entropy relative to a “background” distribution Sum over all columns Key: to align the sites of a motif, and score the alignment
Extending CONSENSUS to multiple species Final goal: Find a set of substrings, one in each input sequence
Extending CONSENSUS to multiple species Final goal: Find a set of “profiles”, one in each set of orthologous input sequences
Extending CONSENSUS to multiple species “Profiles”
Extending CONSENSUS to multiple species “Profiles”
Aligning two “profiles” • Compare two profiles column by column • Each column of a profile is (nA,nC,nG,nT), and equivalently, (fA,fC,fG,fT) • Probabilistic score to capture if two columns {nbi,fbi}b and {nbj,fbj}b are from the same distribution (and different from background) • ALLR: Avg. Log Likelihood Ratio where pb is background frequency of base b
One cool feature of ALLR • Expected value is negative, means very long profiles will not automatically give large ALLR scores • Therefore, can automatically detect the “right” motif length
PhyloCon: features • One of the first algorithms to find motifs that are conserved across species and occur in multiple co-regulated gene promoters • Does not consider the evolutionary relationships among species (all species weighted equally)