1 / 36

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Motif Discovery in Protein Sequences using Messy de Bruijn Graph. Rupali Patwardhan Advisors: Dr. Mehmet Dalkilic Dr. Haixu Tang. Outline of Presentation. Goal Background and Motivation Approach Results Future Work. Goal.

nen
Download Presentation

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Discovery in Protein Sequences using Messy de Bruijn Graph Rupali Patwardhan Advisors: Dr. Mehmet Dalkilic Dr. Haixu Tang Rupali Patwardhan, Capstone Presentation

  2. Outline of Presentation • Goal • Background and Motivation • Approach • Results • Future Work Rupali Patwardhan, Capstone Presentation

  3. Goal To develop an algorithm that can take advantage of the properties of de Bruijn graph todiscover motifs in protein sequences Rupali Patwardhan, Capstone Presentation

  4. What is a motif ? • A repeating pattern • VSKLIPKNRLMISTEWRSLGQQSPGWMHYMP • VMLPKDIAKLVPKTHLMSTEWRNRLGVQQSQG • SGVPRLLTASREWRNLGEPFIDQIHYSPRYAD • YRHVMLPKAMSTEWRSLGLKNPETGTLRILQE • GLGITQSLGWSREWRHTLGEPHILLFKREKDYQ Rupali Patwardhan, Capstone Presentation

  5. Why are motifs interesting ? • They represent regions that have been conserved through evolution • So those regions are likely to be important for the function of the protein (e.g. an active site) • Motifs can be used to classify proteins into families based on their functions, or predict the function of a new protein Rupali Patwardhan, Capstone Presentation

  6. PS00059 Zinc-containing alcohol dehydrogenases signature G-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC] H is a zinc ligand Rupali Patwardhan, Capstone Presentation

  7. Motif Discovery Algorithms • There are two main categories • Stochastic Algorithms • Based on Statistical Significance e.g. MEME, GIBBS • Combinatorial Algorithms • Based on Enumeration e.g. PRATT, SPLASH Rupali Patwardhan, Capstone Presentation

  8. Then why one more ? • Existing algorithms • Are too slow or computationally expensive for massive inputs (e.g. MEME) • Do not handle gapped motifs effectively • Need the length/number of the motifs to be specified in advance Rupali Patwardhan, Capstone Presentation

  9. What is a de Bruijn Graph? • A graph whose nodes are subsequences of same length (l- tuples) and whose edges indicate the subsequences of the two connected nodes overlap • E.g. An edge ACAT  CATS represents the sequence “ACATS” Rupali Patwardhan, Capstone Presentation

  10. CDEF BCDE ABCD DEFG ABCDEFG Rupali Patwardhan, Capstone Presentation

  11. Applying this to Identify Repeating Subsequences • If we have a set of sequences, we can go on adding corresponding nodes and edges to our de Bruijn graph. • If any sub-sequence is repeated, the corresponding edge will already be present in that graph. • So we just increment the weight of that edge. • Eventually the edges corresponding to highly repeated sequences will have higher weights. • Now we can find the motif by simply following the graph along these edges with weights above a specified threshold . Rupali Patwardhan, Capstone Presentation

  12. PAKA ARCD AKAR KARC RCDE CDEK DEKD 1 1 1 1. PAKARCDEKD 1 1 1 Rupali Patwardhan, Capstone Presentation

  13. KHKH PAKA ARCD AKAR KARC RCDE CDEK DEKH EKHK DEKD NARC 1 1 1 1 1 1. PAKARCDEKD 2. NARCDEKHKH 1 2 1 2 1 Rupali Patwardhan, Capstone Presentation

  14. KHKH PAKA ARCD AKAR KARC RCDE CDEK DEKH EKHK DEKD NARC 1 1 1 1 1 1. PAKARCDEKD 2. NARCDEKHKH 1 2 1 2 1 Rupali Patwardhan, Capstone Presentation

  15. Making them Messy • In the context of protein sequences, some amino acid residues can be substituted by some others without affecting the function of the protein. • So a sequence could be considered 'similar' to an edge even though its not identical. • Similarity between amino acid residues is determined using standard scoring matrices, such as BLOSUM62. • In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. Rupali Patwardhan, Capstone Presentation

  16. Example • Consider the same 2 sequences as before, but with K replaced by R in one of them. • PAKARCDERD • NARCDEKHKH • As per BLOSUM62, K  R substitution has a positive substitution score. Rupali Patwardhan, Capstone Presentation

  17. PAKA ARCD AKAR KARC RCDE CDER CDEK EKHK DEKH KHKH NARC DERD 1 1 1 1 1 1 • PAKARCDERD • NARCDEKHKH 1 2 1 1 1 Rupali Patwardhan, Capstone Presentation

  18. PAKA ARCD AKAR KARC RCDE CDER CDEK EKHK DEKH KHKH NARC DERD 1 1 1 1 1 1 • PAKARCDERD • NARCDEKHKH 1 2 1 1.4 1.4 Rupali Patwardhan, Capstone Presentation

  19. Adjusting the weights to account for messiness • Suppose edge A is under consideration, and edges B and C originating from the same node as A are similar to A. WA’  WA + WB*s(A,B) + WC*s(A,C) Rupali Patwardhan, Capstone Presentation

  20. Limitation of this Approach • The motif should have at least a few continuous amino acid residues • So the method may fail if the motif consists of alternate residues • E.g. AxAxCxDxAxGxC (x could be any residue) or AxCDxGxRGxC, since these motifs would not lead to high-weight edges in the de Bruijn graph • The problem is due to the need for overlaps, which is inherent nature of de Bruijn Graphs Rupali Patwardhan, Capstone Presentation

  21. Gapped Version • For each node, we also create nodes obtained by applying a gap mask (or “Dont care” mask) on that node • We currently restrict the maximal number of “Dont cares” in a node to 2 • There are 10 such masks Rupali Patwardhan, Capstone Presentation

  22. Gapped Version • Let ‘1’ represent a conserved amino acid and ‘0’ represent a gap or “Don’t care” • Then the 10 masks can be represented as: 1111, 0111, 1110, 1011, 1101, 1100, 0011, 1001, 0110, 1010, 0101 Rupali Patwardhan, Capstone Presentation

  23. Masking Example • If ANCD is the node that we are applying the mask to • ANCD * 1001 = AxxD • ANCD * 1101 = ANxD • ANCD * 1011 = AxCD Rupali Patwardhan, Capstone Presentation

  24. ARCD RCDM ANCD NCDE ASCD SCDT 1 1 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 1 Rupali Patwardhan, Capstone Presentation

  25. AxCD xCDx AxCD xCDx AxCD xCDx 1 1 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 1 Rupali Patwardhan, Capstone Presentation

  26. AxCD xCDx AxCD xCDx AxCD xCDx 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 3 Rupali Patwardhan, Capstone Presentation

  27. AxCDxxGH ANCD NCDE CDEF DEFG EFGH AxCD NxDE CxEF DxFG ExGH ANxD NCxE CDxF DExG EFxH ANxx NCxx CDxx DExx EFxx xxCD xxDE xxEF xxFG xxGH AxxD NxxE CxxF DxxG ExxH xNCx xCDx xDEx xEFx xFGx AxCx NxDx CxEx DxFx ExGx xNxD xCxE xDxF xExG xFxH . . . . . . . . . . . . . . .

  28. Implementation • The algorithm is implemented in Perl • Web Interface • http://biokdd.informatics.indiana.edu/rpatward/deBruijn/project.html Rupali Patwardhan, Capstone Presentation

  29. Issues in Testing Motif Discovery Algorithms • No Benchmarking dataset • Difficult to compare different algorithms since they have very different kinds of parameters. • Some motifs are easier to find than others. Rupali Patwardhan, Capstone Presentation

  30. Test I • First 100 PROSITE patterns and their corresponding protein families were used as the test dataset to test the accuracy of the output. • The output of the program was compared to MEME and PRATT. Rupali Patwardhan, Capstone Presentation

  31. Results I For MEME and PRATT, the top 3 motifs were considered. Rupali Patwardhan, Capstone Presentation

  32. Test II • We also tested families corresponding to 162 PROSITE patterns that did not have any continuous conserved amino acid residues, but had at least one occurrence of alternate conserved amino acid residues. Rupali Patwardhan, Capstone Presentation

  33. Results II Rupali Patwardhan, Capstone Presentation

  34. MEME was run on IBM SP cluster on 8 processors in parallel Rupali Patwardhan, Capstone Presentation

  35. Future Work • Categorizing easy and difficult motifs. • Extending this approach to consensus-based multiple sequence alignment. • Predicting if a given protein sequence is likely to belong to a particular family or not. Rupali Patwardhan, Capstone Presentation

  36. Acknowledgements • Dr. Mehmet Dalkilic • Dr. Haixu Tang • Dr. Sun Kim • Bioinformatics Research Group Rupali Patwardhan, Capstone Presentation

More Related