Comparative genomics to identify DNA binding motifs

Comparative genomics to identify DNA binding motifs Saurabh Sinha Dept. of Computer Science University of Illinois, Urbana-Champaign

Outline • Binding sites and motifs • The motif finding problem in one species • Comparative genomics and alignment • The motif finding problem with comparative genomics

Motif finding in multiple species • Footprinter : the approach without alignments • PhyloCon : The use of alignments • PhyME & PhyloGibbs : The use of alignments and an evolutionary model • MCS : Genome-wide motif finding from multiple species

Binding sites and motifs

Binding sites • A few binding sites of transcription factor “Bicoid” in the Drosophila (fruitfly) genome, collected experimentally

http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

T A A T C C C Motif http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

W A A T C C N Motif W = T or A N = A,C,G,T “Consensus String” http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif

Motif • Common sequence “pattern” in the binding sites of a transcription factor • A succinct way of capturing variability among the binding sites

Alternative way to represent motif Position weight matrix (PWM) Or simply, “weight matrix”

Motif representation • Consensus string • May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. • Tractable search space, enumerative algorithms • Position weight matrix • More powerful representation • Probabilistic treatment, algorithms • More popular

The motif finding problem(in one species) • Suppose a transcription factor (TF) regulates five different genes • Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF

The motif finding problem • Now suppose we are given the promoter regions of the five genes G1, G2, … G5 • Can we find the binding sites of TF, without knowing about them a priori ? • Binding sites are similar to each other, but not necessarily identical • This is the motif finding problem • To find a motif that represents binding sites of an unknown TF

Motif finding algorithms • Version 1: Given promoter regions of co-regulated genes, find the motif • Existing algorithms: • Gibbs sampling (MCMC) : Lawrence et al. 1993 • MEME (Expectation-Maximization) : Bailey & Elkan 94 • CONSENSUS (Greedy local search, beam search) : Hertz & Stormo • Word enumeration methods (with emphasis on statistical accuracy) • van Helden et al. 1998, Sinha & Tompa 2000 • And a hundred others

Comparative Genomics

species1 GCGTGATCGAGCTATAACGGAA GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE BLOCKS OF CONSERVATION More Data • Genomes of multiple species available

Using multiple genomes • Functional parts of the genome evolve more slowly than non-functional parts • Identify conserved parts by sequence alignment algorithms • Look for functional features in conserved regions – this improves the signal Popular Paradigm in Computational Biology

Multiple sequence alignment • Comparative genomics relies upon the ability to detect “similar” (evolutionarily related) regions in different genomes • The problem of multiple species alignment • A hard computational problem (“NP-hard”) • Several fast heuristics exist (Mlagan, TBA) • Assume this functionality exists …

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF Back To Motif finding

Motif finding from multiple species data • Version 2: Given promoter regions of same gene • from multiple species, find the motif Species 1 Species 2 Gene G Species 3 Species 4 Species 5 Binding sites for TF

Blocks of conservation One approach • Do multiple sequence alignment of upstream regions of gene Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in conserved blocks

Blocks of conservation Another approach (alignment-free) • What if binding sites are not entirely within conserved blocks? Species 1 Species 2 Gene G Species 3 Species 4 Species 5 • Look for recurring motifs in entire upstream regions

Footprinter (Blanchette et al.)The method without alignments

Footprinter • The input sequences are promoter regions of the same gene, but from multiple species. • Such sequences are said to be “orthologous” to each other.

Footprinter Input sequences Related by an evolutionary tree Find motif

A side note: Parsimony • A guiding principle in cross-species comparison • If the data can be explained in multiple ways, prefer the one with the fewer number of events (be parsimonious) • Parsimony score = number of evolutionary events (e.g., substitutions) on the tree • Maximum parsimony principle: minimize parsimony score

Phylogenetic footprinting: formally speaking Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in Tis at most d.

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example Size of motif sought: k = 4

AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation

… ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2ACGT: 1... … ACGG: 1ACGT: 1... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: +... An Exact Algorithm(Blanchette’s algorithm) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s.

Wu [s] =  min ( Wv [t] + d(s, t) ) • A post-order traversal algorithm v:child t ofu Recurrence

Wu [s] =  min ( Wv [t] + d(s, t) ) v:child t ofu Running Time O(k 42k )timeper node

Footprinter: features • One of the earliest motif-finding algorithms based on comparative genomics • Simple formulation of motif score, algorithm efficient in practice • Cannot combine evolutionary conservation information with overrepresentation information • two motifs, equally conserved, but one occurs in many co-regulated genes (promoters)

PhyloCon (Stormo lab)The method with alignments

The underlying single-species algorithm: CONSENSUS Final goal: Find a set of substrings, one in each input sequence Set of substrings define a PWM. Goal: This PWM should have high information content. High information content means that the motif “stands out”.

The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings.

The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif.

? ? ? ? The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Consider every substring in the next sequence, try adding it to current motif and scoring resulting motif

The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. Pick the best one ….

The underlying single-species algorithm: CONSENSUS Start with a substring in one input sequence Build the set of substrings incrementally, adding one substring at a time The current set of substrings. The current motif. … and repeat Pick the best one ….

The key: Scoring a motif The current motif. Scoring a motif:

The key: Scoring a motif The current motif. Scoring a motif: Build a PWM Compute information content of PWM: For each column, Compute relative entropy relative to a “background” distribution Sum over all columns Key: to align the sites of a motif, and score the alignment

Extending CONSENSUS to multiple species Final goal: Find a set of substrings, one in each input sequence

Extending CONSENSUS to multiple species Final goal: Find a set of “profiles”, one in each set of orthologous input sequences

Extending CONSENSUS to multiple species “Profiles”

Extending CONSENSUS to multiple species

Aligning two “profiles” • Compare two profiles column by column • Each column of a profile is (nA,nC,nG,nT), and equivalently, (fA,fC,fG,fT) • Probabilistic score to capture if two columns {nbi,fbi}b and {nbj,fbj}b are from the same distribution (and different from background) • ALLR: Avg. Log Likelihood Ratio where pb is background frequency of base b

One cool feature of ALLR • Expected value is negative, means very long profiles will not automatically give large ALLR scores • Therefore, can automatically detect the “right” motif length

PhyloCon: features • One of the first algorithms to find motifs that are conserved across species and occur in multiple co-regulated gene promoters • Does not consider the evolutionary relationships among species (all species weighted equally)

Comparative genomics to identify DNA binding motifs

Comparative genomics to identify DNA binding motifs

Presentation Transcript

Comparative Genomics

Finding Motifs in DNA

Comparative Genomics

Engineering Transcription Factors with Novel DNA-Binding Specificity using Comparative Genomics

Comparative Genomics

Biocomputation : Comparative Genomics

Comparative genomics Project

Comparative Genomics

Comparative Genomics

Identification of Helix-Turn-Helix (HTH) DNA-Binding Motifs

Cofactor Binding Motifs

Protein Motifs-RNA Binding Domains

Comparative genomics

Comparative Genomics

Comparative genomics to identify DNA binding motifs

Comparative Genomics I: Tools for comparative genomics

Comparative genomics

Comparative Genomics

Introduction to Comparative Genomics

Comparative genomics

Comparative genomics

Comparative Genomics