760 likes | 983 Views
Finding Motifs in DNA. References: 1. Bioinformatics Algorithms, Jones and Pevzner, Chapter 4. 2. Algorithms on Strings, Gusfield, Section 7.11. 3. Beginning Perl for Bioinformatics, Tisdall, Chapter 9. 4. Wikipedia. Summary. Introduce the Motif Finding Problem
E N D
Finding Motifs in DNA References: 1. Bioinformatics Algorithms, Jones and Pevzner, Chapter 4. 2. Algorithms on Strings, Gusfield, Section 7.11. 3. Beginning Perl for Bioinformatics, Tisdall, Chapter 9. 4. Wikipedia
Summary • Introduce the Motif Finding Problem • Explain its significance in bioinformatics • Develop a simple model of the problem • Design algorithmic solutions: • Brute Force • Branch and Bound • Greedy • Compare results of each method.
News: October 6, 2009 IBM Developing Chip to Sequence DNA DNA DNA DNA Gene Discovery May Advance Head and Neck Cancer Therapy DNA on bloody clothes matches missing US diplomat 3 Scientists Share Nobel Chemistry Prize for DNA Work S1P Gene Regulating Lipid May Help Develop New Drugs against Cancer Need a New Heart? Grow Your Own Updated map of human genome to help fight against disease
The Motif Finding Problem • motif noun 1. a recurring subject, theme, idea, etc., esp. in a literary, artistic, or musical work. 2. a distinctive and recurring form, shape, figure, etc., in a design, as in a painting or on wallpaper. 3. a dominant idea or feature: the profit motif of free enterprise.
Example: Fruit Fly • Set of immunity genes. • DNA pattern: TCGGGGATTTCC • Consistently appears upstream of this set of genes. • Regulates timing/magnitude of gene expression. • “Regulatory Motif” • Finding such patterns can be difficult.
Construct an Example: 7 DNA Samples cacgtgaagcgactagctgtactattctgcat cgtccgatctcaggattgtctggggcgacgat gggggcggtgcgggagccagcgctcggcgttt gcaaggcgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggtcacg aggtataatgcgaacagctaaaactccggaaa cccccgcaatttaactagggggcgcttagcgt Pattern acctggcc
Insert Pattern at random locations: cacgtgaacctggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctggccggggcgacgat gacctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggcccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggacctggcctcacg aggtataatgcgaaacctggcccagctaaaactccggaaa cccccgcaaacctggcctttaactagggggcgcttagcgt
Add Mutations: cacgtgaacGtggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctgAccggggcgacgat gGcctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggTccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggaActggcctcacg aggtataatgcgaaacctTgcccagctaaaactccggaaa cccccgcaaacTtggcctttaactagggggcgcttagcgt
Finally, find the hidden pattern: cacgtgaacgtggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctgaccggggcgacgat ggcctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggtccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggaactggcctcacg aggtataatgcgaaaccttgcccagctaaaactccggaaa cccccgcaaacttggcctttaactagggggcgcttagcgt
cacgtgaacgtggccagcgactagctgtactattctgcat cgtccgatctcaggattgtctacctgaccggggcgacgat ggcctggccggggcggtgcgggagccagcgctcggcgttt gcaaggacctggtccgtcaaattgggaggcgcattctgaa ccacaagcgagcgttcctcgggattggaactggcctcacg aggtataatgcgaaaccttgcccagctaaaactccggaaa cccccgcaaacttggcctttaactagggggcgcttagcgt
Three Approachs • Brute Force: • check every possible pattern. • Branch and Bound: • prune away some of the search space. • Greedy: • commit to “nearby” options, never look back.
Brute Force • Given that the pattern is of length = L. • Generate all DNA patterns of length L. • (Called “L-mers”). • Match each one to the DNA samples. • Keep the L-mer with the best match. • “Best” is Based on a scoring function.
Scoring: Hamming Distance an L-mer gtgtaggt L=8 dna sequence gtgtaggt gtgtaggt gtgtaggt accgtaccggtaacaagtaccgtacgggtaacaagtaccgtaggtgtaacaagt 8 mismatches 4 mismatches 2 mismatches Try all starting positions Find the position with the fewest mismatches
Scoring try all possible L-mers t = 8 DNA samples 3 2 1 0 3 2 0 1 total distance = 12 12 Try each possible L-mer Score is equal to the sum of the mismatches at the locations with fewest mismatches on each string. The L-mer with the lowest such score is the optimal answer.
Generating all L-mers • Systematic enumeration of all DNA strings of length L. • DNA has an “alphabet” of 4 letters: { a, c, g, t } • Proteins have an alphabet of 20 letters: • one for each of 20 possible amino acids. • {A,B,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W} • Solve problem for any size alphabet (k) and any size L-mer (L).
Definitions • k = size of alphabet • L = length of strings to be generated • a = vector containing a partial or complete L-mer. • i = number of entries in a already filled in. • Example: k = 4, L = 5, i = 2, a = (2, 4, *, * , * )
ExampleAlphabet = {1, 2}k = 2, L=4 (2222) (1111) i = Depth of the Tree
NEXT VERTEX i = 3 a = 1 3 2 NEXTVERTEX(a, i, L, k) if i < L a(i+1) = 1 return (a, i+1) else for j = L to j = 1 if a(j) < k then a(j) = a(j) +1 return(a, j) return (a,0) 1 i = 4 a = 1 3 2 1 i = L a = 2 3 2 1 2 2 i = L a = 2 3 2 1 2 3 j = 1 j = L
Example: L = 6 k = 3 alhpabet = {1, 2, 3} When i = L (leaf node) ..... i = 6 2 3 2 1 23 i = 5 2 3 2 1 3 i = 6 2 3 2 1 3 1 i = 6 2 3 2 1 3 2 i = 6 2 3 2 1 3 3 i = 4 2 3 2 2 i = 5 2 3 2 2 1 i = 6 2 3 2 2 1 1 i = 6 2 3 2 2 1 2 i = 6 2 3 2 2 1 3 ..... 3 1 2 i = L-1 a = 2 3 2 1 3 3 1 1 2 i = L a = 2 3 2 1 3 1 i = L a = 2 3 2 1 23 j = 1 j = L
Brute Force • Use NEXTVERTEX to generate nodes in the tree. • Translate each numeric value into the corresponding L-mer • (e.g.: 1=a, 2=c, 3=g, 4=t). • Score each L-mer (Hamming distance). • keep the best L-mer (and where it matched in each dna sample).
Branch and Bound • Use same structure as the Brute Force method. • Looks for ways to reduce the computation. • Prune branches of the tree that cannot produce anything better than what we have so far.
BYPASS • BYPASS (a, i, L, k) • for j = i to j = 1 • if a(j) < k • a(j) = a(j) + 1 • return (a, j) • return (a, 0)
BRANCHANDBOUND • a = (1, 1, ..., 1) • bestDistance = infinity • i = L • while (i > 0) • if i < L • prefix = translate(a1, a2, ..., ai) • optimisticDistance = TotalDistance(prefix) • if optimisticDistance > bestDistance • (a, i) = BYPASS(a, i) • else • (a, i) = NEXTVERTEX( a, i ) • else • word = translate (a1, a2, ....., aL) • if TotalDistance( word, DNA ) < bestDistance • bestDistance = TotalDistance(word, DNA) • bestWord = word • (a, i) = NEXTVERTEX( a, i) • return bestWord
Greedy Method • Picks a “good” solution. • Avoids backtracking. • Can give good results. • Generally, not the best possible solution. • But: FAST.
Greedy Method • Given t dna samples (each n-long). • Find the optimal motif for the first two samples. • Lock that choice in place. • For the remainder of the samples: • for each dna sample in turn • find the L-mer that best fits with the prior choices. • never backtrack.
t = 8 DNA samples Step 1: Grab the first two samples and find the optimal alignment (consider all starting points s1 and s2, and keep the largest score). Step 2: Go through each remaining sample, successively finding the starting positions (s3, s4, ...., st) that give the best consensus score for all the choices made so far.
Alignment a t g c Profile a g g c a a c t Consensus 3 3 4 3 5 3 5 4 Scoring
Motif Finding Example n=32 t=16 L=5 atgtgaaaaggcccaggctttgttgttctgat aatcagtttgtggctctctactatgtgcgctg catggcgtaagagcaggtgtacaccgatgctg taaatacacagattccttccgactttctgcat caagccttagctttagatctttgtctcccttt gagccatggactgtccgccagtatcttcctag cgccaactgcccgtttcgcagtgccatgttga agttcccagtcccgatcataggaatttgagca tagggatcgaatgagttgtcctagtcaatcct gtagctcctcaagggatacccacctatcgacg agccgcagcgacaacttgctcgctatctaact ccactccctaagcgctgaacaccggagttctg gaagtcttcttgctgacacattacttgctcgc gaatcgtcgtatgttttcgaccttggtggcat tctcaacatgccttcccctccccaggctatgc tgtgtctatcatcccgttagctacctaaatcg 5 16 32
Branch and Bound Greedy atgtgaaaaggcccaggctttgttgttctgat ***** aatcagtttgtggctctctactatgtgcgctg ***** catggcgtaagagcaggtgtacaccgatgctg ***** taaatacacagattccttccgactttctgcat ***** caagccttagctttagatctttgtctcccttt ***** gagccatggactgtccgccagtatcttcctag ***** cgccaactgcccgtttcgcagtgccatgttga ***** agttcccagtcccgatcataggaatttgagca ***** tagggatcgaatgagttgtcctagtcaatcct ***** gtagctcctcaagggatacccacctatcgacg ***** agccgcagcgacaacttgctcgctatctaact ***** ccactccctaagcgctgaacaccggagttctg ***** gaagtcttcttgctgacacattacttgctcgc ***** gaatcgtcgtatgttttcgaccttggtggcat ***** tctcaacatgccttcccctccccaggctatgc ***** tgtgtctatcatcccgttagctacctaaatcg ***** atgtgaaaaggcccaggctttgttgttctgat ***** aatcagtttgtggctctctactatgtgcgctg ***** catggcgtaagagcaggtgtacaccgatgctg ***** taaatacacagattccttccgactttctgcat ***** caagccttagctttagatctttgtctcccttt ***** gagccatggactgtccgccagtatcttcctag ***** cgccaactgcccgtttcgcagtgccatgttga ***** agttcccagtcccgatcataggaatttgagca ***** tagggatcgaatgagttgtcctagtcaatcct ***** gtagctcctcaagggatacccacctatcgacg ***** agccgcagcgacaacttgctcgctatctaact ***** ccactccctaagcgctgaacaccggagttctg ***** gaagtcttcttgctgacacattacttgctcgc ***** gaatcgtcgtatgttttcgaccttggtggcat ***** tctcaacatgccttcccctccccaggctatgc ***** tgtgtctatcatcccgttagctacctaaatcg ***** consensus_string = ctccc consensus_count = 12 13 12 13 13 final percent score = 78.75 consensus_string = atgtg consensus_count = 14 10 11 12 10 final percent score = 71.25
Branch and Bound Greedy ggccc ctctc caccg cttcc ctccc cttcc ctgcc ttccc gtcct ctcct ctcgc ctccc ctcgc cgacc ctccc atccc consensus_string = ctccc count = 12 13 12 13 13 final percent score = 78.75 atgtg atgtg aggtg ttctg atctt atgga atgtt atttg atgag aaggg acttg aagcg aagtc atgtt acatg gtgtc consensus_string = atgtg count = 14 10 11 12 10 final percent score = 71.25
Example 2 n = 64 t = 16 L = 8 gattacttctcgcccccccgctaagtgtatttctctcgctacctactccgctatgcctacaaca tctaccggcattatctatcggcaatgggagcggtggtgatgcacctagcctactcctttgacta tggtccttactggcatcacgcaccgttcttggcggcctgtgcaatatcttgtccctaaataaat aactacggtcattagtgcgtaatcagcacagccgagccggataagcgacttgtaaccatcttcg gagcaagcatgcagtaggtaacgccaagagcggggctttagggagccgcaatcgggacagatct aaaggttctctggatctatagctcacaaatttgcaggggtacgacagagttatagagtgtacca ggcgctttcctcccgagcagagggaacgaacgaccataatgtaagagaatctttatgtccaagc cgtcctgtccatacgtatgttttcaaaactgcgtctagattagtgaggaacagatttaagattc atccagcaacttgtgcattcgtagggagcggacacaaaggacatgatcagacgaaacctatttt cctcaattgaggcccccccccagttgtccgaccgcacgaaccgcttcgcaaaagtgttgcccgc aaccacaccaagtattgctaatgcaccattcttatgtttttgagcagcaaagcgactacgctgt atataggaaaaatcttagtgcaccaagatttaacctgcactttgctttgaaatacaactgtcgg ctttcaataaatgttaattgcgttccctcacttgctcggtcgagtcgtatcgtattcgatcagg tagcgggcacgctcgctcgacgttcatccactcgatagagccggtcatttttcggaactagtaa ggaggaatgagtctacgtcgcgttaagacgaactttacgtgtgtgcaggcttattttcgtccac cctccgggggacgtagactgttcttccacagttctaggcggcgcggtcttggcttgaacaatga
Branch and Bound Greedy gattacttctcgcccccccgctaagtgtatttctctcgctacctactccgctatgcctacaaca ******** tctaccggcattatctatcggcaatgggagcggtggtgatgcacctagcctactcctttgacta ******** tggtccttactggcatcacgcaccgttcttggcggcctgtgcaatatcttgtccctaaataaat ******** aactacggtcattagtgcgtaatcagcacagccgagccggataagcgacttgtaaccatcttcg ******** gagcaagcatgcagtaggtaacgccaagagcggggctttagggagccgcaatcgggacagatct ******** aaaggttctctggatctatagctcacaaatttgcaggggtacgacagagttatagagtgtacca ******** ggcgctttcctcccgagcagagggaacgaacgaccataatgtaagagaatctttatgtccaagc ******** cgtcctgtccatacgtatgttttcaaaactgcgtctagattagtgaggaacagatttaagattc ******** atccagcaacttgtgcattcgtagggagcggacacaaaggacatgatcagacgaaacctatttt ******** cctcaattgaggcccccccccagttgtccgaccgcacgaaccgcttcgcaaaagtgttgcccgc ******** aaccacaccaagtattgctaatgcaccattcttatgtttttgagcagcaaagcgactacgctgt ******** atataggaaaaatcttagtgcaccaagatttaacctgcactttgctttgaaatacaactgtcgg ******** ctttcaataaatgttaattgcgttccctcacttgctcggtcgagtcgtatcgtattcgatcagg ******** tagcgggcacgctcgctcgacgttcatccactcgatagagccggtcatttttcggaactagtaa ******** ggaggaatgagtctacgtcgcgttaagacgaactttacgtgtgtgcaggcttattttcgtccac ******** cctccgggggacgtagactgttcttccacagttctaggcggcgcggtcttggcttgaacaatga ******** gattacttctcgcccccccgctaagtgtatttctctcgctacctactccgctatgcctacaaca ******** tctaccggcattatctatcggcaatgggagcggtggtgatgcacctagcctactcctttgacta ******** tggtccttactggcatcacgcaccgttcttggcggcctgtgcaatatcttgtccctaaataaat ******** aactacggtcattagtgcgtaatcagcacagccgagccggataagcgacttgtaaccatcttcg ******** gagcaagcatgcagtaggtaacgccaagagcggggctttagggagccgcaatcgggacagatct ******** aaaggttctctggatctatagctcacaaatttgcaggggtacgacagagttatagagtgtacca ******** ggcgctttcctcccgagcagagggaacgaacgaccataatgtaagagaatctttatgtccaagc ******** cgtcctgtccatacgtatgttttcaaaactgcgtctagattagtgaggaacagatttaagattc ******** atccagcaacttgtgcattcgtagggagcggacacaaaggacatgatcagacgaaacctatttt ******** cctcaattgaggcccccccccagttgtccgaccgcacgaaccgcttcgcaaaagtgttgcccgc ******** aaccacaccaagtattgctaatgcaccattcttatgtttttgagcagcaaagcgactacgctgt ******** atataggaaaaatcttagtgcaccaagatttaacctgcactttgctttgaaatacaactgtcgg ******** ctttcaataaatgttaattgcgttccctcacttgctcggtcgagtcgtatcgtattcgatcagg ******** tagcgggcacgctcgctcgacgttcatccactcgatagagccggtcatttttcggaactagtaa ******** ggaggaatgagtctacgtcgcgttaagacgaactttacgtgtgtgcaggcttattttcgtccac ******** cctccgggggacgtagactgttcttccacagttctaggcggcgcggtcttggcttgaacaatga ******** consensus_string = ccatattt count = 10 11 11 11 13 10 11 14 final percent score = 71.09375 consensus_string = cgtactcc count = 11 10 13 11 10 12 10 8 final percent score = 66.40625
Summary • Introduce the Motif Finding Problem • Explain its significance in bioinformatics • Develop a simple model of the problem • Design algorithmic solutions: • Brute Force • Branch and Bound • Greedy • Compare results of each method.
Neural Networks for Optimization Bill Wolfe California State University Channel Islands Reference A Fuzzy Hopfield-Tank TSP Model Wolfe, W. J. INFORMS Journal on Computing, Vol. 11, No. 4, Fall 1999 pp. 329-344
Neural Models • Simple processing units • Lots of them • Highly interconnected • Exchange excitatory and inhibitory signals • Variety of connection architectures/strengths • “Learning”: changes in connection strengths • “Knowledge”: connection architecture • No central processor: distributed processing
Simple Neural Model • aiActivation • ei External input • wij Connection Strength Assume: wij = wji (“symmetric” network) W = (wij) is a symmetric matrix
Net Input Vector Format:
Dynamics • Basic idea:
Lower Energy • da/dt = net = -grad(E) seeks lower energy
Keeps the activation vector inside the hypercube boundaries Encourages convergence to corners
A Neural Model aiActivation eiExternal Input wijConnection Strength W (wij = wji) Symmetric
Example: Inhibitory Networks • Completely inhibitory • wij = -1 for all i,j • winner take all • Inhibitory Grid • neighborhood inhibition • on-center, off-surround