290 likes | 404 Views
Finding Compact Structural Motifs. Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo, Ontario, Canada j3qian@cs.uwaterloo.ca. Outline. Introduction to Structural Motif Related Work
E N D
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo, Ontario, Canada j3qian@cs.uwaterloo.ca
Outline • Introduction to Structural Motif • Related Work • Compact Motif-finding Problem Formulation • NP-Hard of the Compact Motif-finding Problem • A Polynomial Time Approximate Scheme
Outline • Introduction to Structural Motif • Related Work • Compact Motif-finding Problem Formulation • NP-Hard of the Compact Motif-finding Problem • A Polynomial Time Approximate Scheme
Introduction • Protein is a sequence of amino acids. • A protein always folds into a specific 3-D shape. • Structures are important to proteins: • The functional properties of proteins depend on their 3-D structures. • Structures are more conserved than sequence during the evolution of proteins.
Structural Motif • Structural motif is a frequently occurring substructure of proteins. • Motifs are thought to be tightly related to protein functions. • Identifying motifs from a set of proteins can help us to know their evolutionary history and functions.
Structural Motif Finding Problem • Given a set of protein structures, to find the frequently occurring substructure. • Informally, to find one substructure from each protein, that exhibit the highest degree of similarity.
How to measure the similarity of two substructures? • Two popular measurements: • dRMSD: measure the root mean square Euclidean distance between the corresponding residues from different protein structures. • cRMSD: calculate the internal distance matrix for each protein, and compare the distance matrices for input structures.
Outline • Introduction to Structural Motif • Related Work • Compact Motif-finding Problem Formulation • NP-Hard of the Compact Motif-finding Problem • A Polynomial Time Approximate Scheme
Related Work • L.P.Chew proposed an iterative algorithm to compute the conserved shape and proved its convergence. (2002) • D. Bandyopadhyay applied graph-based data-mining tools to find the family-specific fingerprints. (2006) • M. Shatsky presented an algorithm to uncover the binding pattern. (2006) • DALI and CE attempt to identify structural alignment with minimal dRMSD. • STRUCTRAL and TM-Align employ heuristics to detect the alignment with minimal cRMSD.
Related Work (continued) • However, these methods are all heuristic; the solutions are not guaranteed to be optimal or near optimal. • The first PTAS for pairwise structural alignment: • R. Kolodny explored the Lipschitz property of the scoring function. (2004) • Though this algorithm can be extended to the case of multiple structure alignment, the simple extension has a time complexity exponential in the number of proteins. • Is there a PTAS to multiple structure motif finding?
Outline • Introduction to Structural Motif • Related Work • Compact Motif-finding Problem Formulation • NP-Hard of the Compact Motif-finding Problem • A Polynomial Time Approximate Scheme
We focus on (R, C)-Compact Motif. • What is (R, C)-compact motif? • A motif is bounded in a minimum ball with radius R. • In this ball, at most C residues do not belong to this motif. • (R,C)-compact motif is biologically meaningful since • We focus on globular proteins. • We allows at most C exceptions.
(R, C)-Compact Motif Finding Problem • Input: protein structures S1…, Sn, and length l • Output: a consensus consists of l 3D points • q=(q1, …, ql ) • a substructure ui from each protein Si • Objective: • min (1 ind2(q, ui))1/2 • Here, we adopt the dRMSD distance function,i.e., • d(q, ui)=min||q- (ui)||2 • consists of a rotation and a translation • ||*||2 is the Euclidean metric.
Outline • Introduction to Structural Motif • Related Work • Compact Motif-finding Problem Formulation • NP-Hard of the Compact Motif-finding Problem • A Polynomial Time Approximate Scheme
(R,C)-compact motif finding is still NP-Hard. • Reduction from the Sequence Consensus Problem • Input: n binary strings S1, …,Sn, each is of length m • Output: A substring ti of length l from each string Si, 1i n, • Objective: minimize 1 i <i’ ndH(ti, ti’), where dH is Hamming distance. • Basic Idea: • Try to find a way of reduction to make: dRMSD=Hamming Distance
(R,C)-compact motif finding is still NP-Hard. • Each l-mer is transformed into 6l 3D points. • 110 110 001 000000 111111 • 0(0, 2i, 0), 1(1, 2i, 0)
(R,C)-compact motif finding is still NP-Hard. • Each l-mer is transformed into 6l 3D points. • 110 110 001 000000 111111 • 0(0, 2i, 0), 1(1, 2i, 0) • The centroid will be (1/2, 2i, 0) (Easy translation) • Large “tail” no rotation • RMSD = Hamming Distance • Small distortion to each point to make it protein-like. • Sequence Consensus Problem (1,0)-Compact Motif Finding Problem
Outline • Introduction to Structural Motif • Related Work • Compact Motif-finding Problem Formulation • NP-Hard of the Compact Motif-finding Problem • A Polynomial Time Approximate Scheme
The Basic Idea of Our PTAS • There are always a few “important” sub-structures, whose consensus holds most of the “secrets” of the true optimal motif. • Therefore, if we can simply do exhaustive search to find these few sub-structures, then the trivial optimal solution for these sub-structures is a good approximation to the real optimal solution.
Technique 1: Sampling • We sample only r proteins, • consider each motif in a sampled protein, • we can say we almost know the optimal solution.
Sampling will introduce only a bit of error. • There is at least one selection schema, whose consensus has a cost value less than (1+1/r)OPT. • So, we can find this schema by simply enumerating operation.
Technique 2: Discretize the Rotation Space • Each rotation is parameterized by three angles • 1, 2, 3[0, 2) • Discretize the angles with step size ’ • we get an ’-rotation net.
Discretized rotation will not introduce a large error, either. • A parameterized algorithm for protein structure alignment. J. Xu, F. Jiao, and B. Berger. RECOMB2006.
Running Time • Each protein contains M motifs • M is a polynomial of protein length • Each motif can adopt W rotations • W depends on the constant • So the number of consensus is less than • O(nr(MW)r)= O((nMW)r)
Conclusion and Future Work • We prove the (R,C)-compact motif finding problem is NP-hard • We obtain a PTAS for this problem. • Future Work: • Further reduce the time complexity • Design some practical algorithms. • Solve a more general case.