360 likes | 641 Views
Motif Discovery in Protein Sequences using Messy de Bruijn Graph. Rupali Patwardhan Advisors: Dr. Mehmet Dalkilic Dr. Haixu Tang. Outline of Presentation. Goal Background and Motivation Approach Results Future Work. Goal.
E N D
Motif Discovery in Protein Sequences using Messy de Bruijn Graph Rupali Patwardhan Advisors: Dr. Mehmet Dalkilic Dr. Haixu Tang Rupali Patwardhan, Capstone Presentation
Outline of Presentation • Goal • Background and Motivation • Approach • Results • Future Work Rupali Patwardhan, Capstone Presentation
Goal To develop an algorithm that can take advantage of the properties of de Bruijn graph todiscover motifs in protein sequences Rupali Patwardhan, Capstone Presentation
What is a motif ? • A repeating pattern • VSKLIPKNRLMISTEWRSLGQQSPGWMHYMP • VMLPKDIAKLVPKTHLMSTEWRNRLGVQQSQG • SGVPRLLTASREWRNLGEPFIDQIHYSPRYAD • YRHVMLPKAMSTEWRSLGLKNPETGTLRILQE • GLGITQSLGWSREWRHTLGEPHILLFKREKDYQ Rupali Patwardhan, Capstone Presentation
Why are motifs interesting ? • They represent regions that have been conserved through evolution • So those regions are likely to be important for the function of the protein (e.g. an active site) • Motifs can be used to classify proteins into families based on their functions, or predict the function of a new protein Rupali Patwardhan, Capstone Presentation
PS00059 Zinc-containing alcohol dehydrogenases signature G-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC] H is a zinc ligand Rupali Patwardhan, Capstone Presentation
Motif Discovery Algorithms • There are two main categories • Stochastic Algorithms • Based on Statistical Significance e.g. MEME, GIBBS • Combinatorial Algorithms • Based on Enumeration e.g. PRATT, SPLASH Rupali Patwardhan, Capstone Presentation
Then why one more ? • Existing algorithms • Are too slow or computationally expensive for massive inputs (e.g. MEME) • Do not handle gapped motifs effectively • Need the length/number of the motifs to be specified in advance Rupali Patwardhan, Capstone Presentation
What is a de Bruijn Graph? • A graph whose nodes are subsequences of same length (l- tuples) and whose edges indicate the subsequences of the two connected nodes overlap • E.g. An edge ACAT CATS represents the sequence “ACATS” Rupali Patwardhan, Capstone Presentation
CDEF BCDE ABCD DEFG ABCDEFG Rupali Patwardhan, Capstone Presentation
Applying this to Identify Repeating Subsequences • If we have a set of sequences, we can go on adding corresponding nodes and edges to our de Bruijn graph. • If any sub-sequence is repeated, the corresponding edge will already be present in that graph. • So we just increment the weight of that edge. • Eventually the edges corresponding to highly repeated sequences will have higher weights. • Now we can find the motif by simply following the graph along these edges with weights above a specified threshold . Rupali Patwardhan, Capstone Presentation
PAKA ARCD AKAR KARC RCDE CDEK DEKD 1 1 1 1. PAKARCDEKD 1 1 1 Rupali Patwardhan, Capstone Presentation
KHKH PAKA ARCD AKAR KARC RCDE CDEK DEKH EKHK DEKD NARC 1 1 1 1 1 1. PAKARCDEKD 2. NARCDEKHKH 1 2 1 2 1 Rupali Patwardhan, Capstone Presentation
KHKH PAKA ARCD AKAR KARC RCDE CDEK DEKH EKHK DEKD NARC 1 1 1 1 1 1. PAKARCDEKD 2. NARCDEKHKH 1 2 1 2 1 Rupali Patwardhan, Capstone Presentation
Making them Messy • In the context of protein sequences, some amino acid residues can be substituted by some others without affecting the function of the protein. • So a sequence could be considered 'similar' to an edge even though its not identical. • Similarity between amino acid residues is determined using standard scoring matrices, such as BLOSUM62. • In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. Rupali Patwardhan, Capstone Presentation
Example • Consider the same 2 sequences as before, but with K replaced by R in one of them. • PAKARCDERD • NARCDEKHKH • As per BLOSUM62, K R substitution has a positive substitution score. Rupali Patwardhan, Capstone Presentation
PAKA ARCD AKAR KARC RCDE CDER CDEK EKHK DEKH KHKH NARC DERD 1 1 1 1 1 1 • PAKARCDERD • NARCDEKHKH 1 2 1 1 1 Rupali Patwardhan, Capstone Presentation
PAKA ARCD AKAR KARC RCDE CDER CDEK EKHK DEKH KHKH NARC DERD 1 1 1 1 1 1 • PAKARCDERD • NARCDEKHKH 1 2 1 1.4 1.4 Rupali Patwardhan, Capstone Presentation
Adjusting the weights to account for messiness • Suppose edge A is under consideration, and edges B and C originating from the same node as A are similar to A. WA’ WA + WB*s(A,B) + WC*s(A,C) Rupali Patwardhan, Capstone Presentation
Limitation of this Approach • The motif should have at least a few continuous amino acid residues • So the method may fail if the motif consists of alternate residues • E.g. AxAxCxDxAxGxC (x could be any residue) or AxCDxGxRGxC, since these motifs would not lead to high-weight edges in the de Bruijn graph • The problem is due to the need for overlaps, which is inherent nature of de Bruijn Graphs Rupali Patwardhan, Capstone Presentation
Gapped Version • For each node, we also create nodes obtained by applying a gap mask (or “Dont care” mask) on that node • We currently restrict the maximal number of “Dont cares” in a node to 2 • There are 10 such masks Rupali Patwardhan, Capstone Presentation
Gapped Version • Let ‘1’ represent a conserved amino acid and ‘0’ represent a gap or “Don’t care” • Then the 10 masks can be represented as: 1111, 0111, 1110, 1011, 1101, 1100, 0011, 1001, 0110, 1010, 0101 Rupali Patwardhan, Capstone Presentation
Masking Example • If ANCD is the node that we are applying the mask to • ANCD * 1001 = AxxD • ANCD * 1101 = ANxD • ANCD * 1011 = AxCD Rupali Patwardhan, Capstone Presentation
ARCD RCDM ANCD NCDE ASCD SCDT 1 1 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 1 Rupali Patwardhan, Capstone Presentation
AxCD xCDx AxCD xCDx AxCD xCDx 1 1 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 1 Rupali Patwardhan, Capstone Presentation
AxCD xCDx AxCD xCDx AxCD xCDx 1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… 3 Rupali Patwardhan, Capstone Presentation
AxCDxxGH ANCD NCDE CDEF DEFG EFGH AxCD NxDE CxEF DxFG ExGH ANxD NCxE CDxF DExG EFxH ANxx NCxx CDxx DExx EFxx xxCD xxDE xxEF xxFG xxGH AxxD NxxE CxxF DxxG ExxH xNCx xCDx xDEx xEFx xFGx AxCx NxDx CxEx DxFx ExGx xNxD xCxE xDxF xExG xFxH . . . . . . . . . . . . . . .
Implementation • The algorithm is implemented in Perl • Web Interface • http://biokdd.informatics.indiana.edu/rpatward/deBruijn/project.html Rupali Patwardhan, Capstone Presentation
Issues in Testing Motif Discovery Algorithms • No Benchmarking dataset • Difficult to compare different algorithms since they have very different kinds of parameters. • Some motifs are easier to find than others. Rupali Patwardhan, Capstone Presentation
Test I • First 100 PROSITE patterns and their corresponding protein families were used as the test dataset to test the accuracy of the output. • The output of the program was compared to MEME and PRATT. Rupali Patwardhan, Capstone Presentation
Results I For MEME and PRATT, the top 3 motifs were considered. Rupali Patwardhan, Capstone Presentation
Test II • We also tested families corresponding to 162 PROSITE patterns that did not have any continuous conserved amino acid residues, but had at least one occurrence of alternate conserved amino acid residues. Rupali Patwardhan, Capstone Presentation
Results II Rupali Patwardhan, Capstone Presentation
MEME was run on IBM SP cluster on 8 processors in parallel Rupali Patwardhan, Capstone Presentation
Future Work • Categorizing easy and difficult motifs. • Extending this approach to consensus-based multiple sequence alignment. • Predicting if a given protein sequence is likely to belong to a particular family or not. Rupali Patwardhan, Capstone Presentation
Acknowledgements • Dr. Mehmet Dalkilic • Dr. Haixu Tang • Dr. Sun Kim • Bioinformatics Research Group Rupali Patwardhan, Capstone Presentation