Finding Motifs

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas

Motif consensus • The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms • A motifinstance is a particular realization of the motif consensus in a given gene; it will differ from the consensus in a small number of positions

Motif data example (made up) • Motif instances: • AAAAACAC • CAAAACAA • ACACAAAA • CAAAAAAC • AAAGAACA • GACAAAAA • AAGAGAAA • Motif consensus: AAAAAAAA

Motif data example (real) • Positions 3-9 (out of about 22) of the cyclic AMP receptor protein transcription factor binding site in 20 samples • TTGTGGC • TTTTGAT • AAGTGTC • ATTTGCA • CTGTGAG • ATGCAAA • GTGTTAA • ATTTGAA • TTGTGAT • ATTTATT • ACGTGAT • ATGTGAG • TTGTGAG • CTGTAAC • CTGTGAA • TTGTGAC • GCCTGAC • TTGTGAT • TTGTGAT • GTGTGAA

Phylogenetic footprinting • A phylogenetic tree organizes related (orthologous)sequences from different species • The sequences appear as leaves • Internal nodes indicate evolutionary divergence between species • A footprint is a highly conserved region across species

Identifying footprints • Main assumption: Functional DNA changes more slowly than other DNA • Therefore, closely related regions in different species are • more likely to be functional sequences • a basis for grouping species together • Footprints are DNA motifs

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACGG

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACG[TG] ACGG

AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACGT T→G mutation ACGT ACGG

Finding motifs • Start with a number of related genes (or proteins) • In regulatory motif finding, • the related genes are co-expressed • Recall our discussion of DNA micro-arrays

Finding motifs: Start The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) . . .

Finding motifs: Goal The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) . . .

How does this relate to what we have discussed before? • Motif finding a clear instance of a data mining problem • Motif finding is equivalent to local alignment across multiple sequences • Typically hundreds of sequences are aligned, sometimes thousands • There are also corresponding biological problems for global alignment of multiple sequences

Multiple sequence alignment • Protein families • Sets of proteins with similar structure (3D shape), function, or evolutionary history • Usually the above properties are correlated • Given several families, where to assign a new protein? • DNA repeating sequences • ALU sequence in humans (300bp, appears more than 1 million times – 10% of our DNA) • Estimated 60% of the “junk” in human genome consists of such sequences

Optimal alignment • We define the multiple global alignment as an extension of strings S1, S2, ..., Sk to S′1, S′2, ..., S′k that may contain spaces with • |S′1| = |S′2| = ... = |S′k| • Removing all spaces from each S′i leaves Si • No position has a space in all S′i • We need to extend our similarity function to handle multiple strings • The optimal alignment is the one that maximizes the similarity function

Multiple string similarity • Many ways to do so. Most common: Sum of pairwise similarities • Assumes symmetric similarity • We need to account for σ(-,-) (usually 0) • Alternatively, we can use distances between strings and minimize the sum of the pairwise distances

Dynamic programming for multiple sequence alignment • In pairwise alignment, we used a two-dimensional matrix to record three choices at each cell: {01}, {10}, and {11} where 1 means consume a character from the corresponding string

DP for multiple alignment • For k stringswe need a k-dimensional table • Each dimension has as many elements as the length of the corresponding string plus one (for gaps at the start) • Assuming the same length n, the matrix has (n+1)kcells • At each cell, we consider 2k – 1 choices

Multiple alignment complexity • (n+1)k = O(nk) entries need to be filled, each in O(2k) time • Total time O(nk2k) = O((2n)k) • Total space O(nk) • Typically n is a few thousand, k a few hundred making this approach impractical • Independently of whether DP is used, for the sum of pairwise similarities the problem is provably NP-complete

What to do for NP-complete problems? • Use exact methods (such as DP) for small inputs only • Use approximate methods with polynomial time and a provable error bound • Use heuristic approaches that follow plausible choices but have no guaranteed error bound • specific to the problem (such as FASTA) • general (optimization, estimation via statistical sampling such as MCMC)

Center star algorithm for multiple sequence global alignment • T is the set of strings that we want to align • Pick ST that minimizes • The initial alignment starts with S (≡S1) • Suppose we have already aligned S1, S2, ..., Si as S′1, S′2, ..., S′i. Then we add the remaining strings one at a time by aligning Si+1 with S′1, obtaining S′i+1 and S′′1. We replace S′1 with S′′1 and add spaces to S′2, ..., S′i wherever spaces were added to S′1.

Finding Motifs

Finding Motifs

Presentation Transcript

Finding Regulatory Motifs in DNA Sequences

Finding Motifs in DNA

Finding Compact Structural Motifs

Motifs

Kavosh : a new algorithm for finding network motifs

Motifs

Bio277 Lab 3: Finding Transcription Factor Binding Motifs

Finding Regulatory Motifs in DNA Sequences

Finding sequence motifs in PBM data Workshop Project

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences

Finding Regulatory Motifs in DNA Sequences

Motifs

Finding Transcription Factor Motifs

Finding Regulatory Motifs in DNA Sequences

Motifs

Finding Motifs in Promoter Regions

Finding Subtle Motifs by Branching from Sample Strings

Motifs, Motifs, Motifs

Finding Regulatory Motifs

Motifs

Motifs