1 / 1

Protein Domain Finding Problem

Discover protein domains by identifying statistically over-represented regions in amino acid sequences using a clever and efficient algorithm, the Markov Clustering Algorithm.

nessa
Download Presentation

Protein Domain Finding Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological Background Proteins are the basic mechanism through which cell processes are initiated, and understanding their structure and function is key to understanding how cells operate. Proteins consist of protein domains, or sequences of amino acids which are self-stabilizing and which fold independently of the rest of the protein. Protein domains figure prominently in the biological function of the protein they belong to. Furthermore, many domains are not unique to a protein family, but appear in many different proteins. Thus studying individual protein domains provides insight into the geometric structure and function of many different proteins. This information, in turn, gives us a key to understanding and inferring cell behavior. Being able to correctly identify these domains in the amino acid sequence is the first step. Problem Formulation From the computer science point of view, we have several sequences (2 to 100) of length 500 to 50000 amino acids. We are trying to identify the protein domains by looking for regions that are statistically over-represented. These regions can be of length 30 to 3000 letters. The trivial approach of looking for exact sequence matches doesn't work for two reasons. First of all, we don't know the exact length of the domain (and the input size is so large that we can't try out all lengths). Second, domains have mutations and insertions within them, so searching for exact matches won’t yield good results. Thus we have to come up with a more clever and efficient algorithm. Markov Clustering Algorithm This is the key algorithm in our approach. Given an adjacency matrix A of a graph, MCA produces a forest (collection of trees) such that each tree corresponds to a cluster of highly inter-connected vertices. The algorithm works by alternately executing Expansion and Inflation steps on the matrix. Expansion is A = Ax where x is an appropriately chosen coefficient based on graph connectivity. Inflation is raising every entry of A to some power y, followed by column normalization: aijA, aij = aijy aijA, aij = aij / anj n Protein Domain Finding Problem Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First, we break the sequences into k-mers (subsequences of length k) including overlaps. These k-mers will be the vertices of our graph. For each pair of k-mers, we compute their similarity score using the BLOSUM matrix and, if that score is greater than a certain threshold, we add an edge to the graph. Hence we get a sparse graph G = (V, E) which is represented as an adjacency matrix A. Clusters of kmers Amino acid sequences Protein domains to be discovered Amino acid sequences Step 3: Creating a graph of clusters In this step our goal is to identify sets that belong to the same domains. We compute pairwise similarity between clusters and create a corresponding adjacency matrix. This is similar to the first step, but now we are not interested in k-mer similarity but rather structural similarity of sets. Step 4: Clustering the clusters This step is identical to step 2, except the adjacency matrix is for clusters, not individual k-mers. Markov Clustering Algorithm now produces sets of clusters. We assume that clusters within the same set must contain anchoring points in repetitions of the same domain. Step 5: Combining clusters together to form domains By combining the sets of clusters obtained from step 4 and adding the amino acids in between two similar anchoring points, we get an outline of the domain. We can then extend these outlines using Dynamic Programming. The extension has not yet been implemented. Protein domains to be discovered -AABABBABA- 1)AABA 2)ABAB 6) BABA 5)BBAB 3)BABB 4)ABBA A = Step 2: Clustering the k-mers The first step was easy, now comes the main portion of the algorithm. We try to cluster the graph using the Markov Clustering Algorithm. We then assume that each resulting cluster is a set of anchoring points such that there is a unique isomorphism between these points and repetitious occurrences of a domain. In other words, each domain contains an instance of an anchoring point. The number of such clusters is much greater than the number of domains, hence each instance of a domain contains many anchoring points from separate sets. Clusters of kmers Amino acid sequences Protein domains to be discovered Domain outlines References: Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. Many thanks to Marina Sirota for assistance in poster preparation.

More Related