240 likes | 335 Views
Finding Motifs. Vasileios Hatzivassiloglou University of Texas at Dallas. Motif consensus. The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms
E N D
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas
Motif consensus • The consensus is the true underlying motif, that is expressed imperfectly in real genes because of mutations across organisms • A motifinstance is a particular realization of the motif consensus in a given gene; it will differ from the consensus in a small number of positions
Motif data example (made up) • Motif instances: • AAAAACAC • CAAAACAA • ACACAAAA • CAAAAAAC • AAAGAACA • GACAAAAA • AAGAGAAA • Motif consensus: AAAAAAAA
Motif data example (real) • Positions 3-9 (out of about 22) of the cyclic AMP receptor protein transcription factor binding site in 20 samples • TTGTGGC • TTTTGAT • AAGTGTC • ATTTGCA • CTGTGAG • ATGCAAA • GTGTTAA • ATTTGAA • TTGTGAT • ATTTATT • ACGTGAT • ATGTGAG • TTGTGAG • CTGTAAC • CTGTGAA • TTGTGAC • GCCTGAC • TTGTGAT • TTGTGAT • GTGTGAA
Phylogenetic footprinting • A phylogenetic tree organizes related (orthologous)sequences from different species • The sequences appear as leaves • Internal nodes indicate evolutionary divergence between species • A footprint is a highly conserved region across species
Identifying footprints • Main assumption: Functional DNA changes more slowly than other DNA • Therefore, closely related regions in different species are • more likely to be functional sequences • a basis for grouping species together • Footprints are DNA motifs
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACGG
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACG[TG] ACGG
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACAG...(Rabbit) GAACGGAGTACTG...(Mouse) TCGTGACGGTGAT... (Rat) Phylogenetic footprinting example ACGT ACGT T→G mutation ACGT ACGG
Finding motifs • Start with a number of related genes (or proteins) • In regulatory motif finding, • the related genes are co-expressed • Recall our discussion of DNA micro-arrays
Finding motifs: Start The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) . . .
Finding motifs: Goal The red part is a gene, the green line is the associated regulatory region in non-coding DNA, and the yellow boxes are the motif instances (unknown) . . .
How does this relate to what we have discussed before? • Motif finding a clear instance of a data mining problem • Motif finding is equivalent to local alignment across multiple sequences • Typically hundreds of sequences are aligned, sometimes thousands • There are also corresponding biological problems for global alignment of multiple sequences
Multiple sequence alignment • Protein families • Sets of proteins with similar structure (3D shape), function, or evolutionary history • Usually the above properties are correlated • Given several families, where to assign a new protein? • DNA repeating sequences • ALU sequence in humans (300bp, appears more than 1 million times – 10% of our DNA) • Estimated 60% of the “junk” in human genome consists of such sequences
Optimal alignment • We define the multiple global alignment as an extension of strings S1, S2, ..., Sk to S′1, S′2, ..., S′k that may contain spaces with • |S′1| = |S′2| = ... = |S′k| • Removing all spaces from each S′i leaves Si • No position has a space in all S′i • We need to extend our similarity function to handle multiple strings • The optimal alignment is the one that maximizes the similarity function
Multiple string similarity • Many ways to do so. Most common: Sum of pairwise similarities • Assumes symmetric similarity • We need to account for σ(-,-) (usually 0) • Alternatively, we can use distances between strings and minimize the sum of the pairwise distances
Dynamic programming for multiple sequence alignment • In pairwise alignment, we used a two-dimensional matrix to record three choices at each cell: {01}, {10}, and {11} where 1 means consume a character from the corresponding string
DP for multiple alignment • For k stringswe need a k-dimensional table • Each dimension has as many elements as the length of the corresponding string plus one (for gaps at the start) • Assuming the same length n, the matrix has (n+1)kcells • At each cell, we consider 2k – 1 choices
Multiple alignment complexity • (n+1)k = O(nk) entries need to be filled, each in O(2k) time • Total time O(nk2k) = O((2n)k) • Total space O(nk) • Typically n is a few thousand, k a few hundred making this approach impractical • Independently of whether DP is used, for the sum of pairwise similarities the problem is provably NP-complete
What to do for NP-complete problems? • Use exact methods (such as DP) for small inputs only • Use approximate methods with polynomial time and a provable error bound • Use heuristic approaches that follow plausible choices but have no guaranteed error bound • specific to the problem (such as FASTA) • general (optimization, estimation via statistical sampling such as MCMC)
Center star algorithm for multiple sequence global alignment • T is the set of strings that we want to align • Pick ST that minimizes • The initial alignment starts with S (≡S1) • Suppose we have already aligned S1, S2, ..., Si as S′1, S′2, ..., S′i. Then we add the remaining strings one at a time by aligning Si+1 with S′1, obtaining S′i+1 and S′′1. We replace S′1 with S′′1 and add spaces to S′2, ..., S′i wherever spaces were added to S′1.