180 likes | 309 Views
An Efficient Index-based Protein Structure Database Searching Method. 陳冠宇. Introduction. More than 18,000 protein structures stored in PDB (September 2002) Structural comparison(3D) and database searching – other methods practice exhaustive searching Their design philosophy:
E N D
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇
Introduction • More than 18,000 protein structures stored in PDB (September 2002) • Structural comparison(3D) and database searching – other methods practice exhaustive searching • Their design philosophy: • Filter-and-refine • Using Indexed-based searching method • Results: 16 times faster than DALI
Filter-and-Refine ProtDex Actual alignment query result Top 100 proteins Database 20,000 proteins
Problem Definition • Protein Structures • 3D Structural Comparison • Structural Database Searching
A protein is composed of a sequence of amino acid (AA) residues. SSE – secondary structure element (ex. helices, sheets) Loop Regions (no specific shape)
Sequence Comparison vs. Structural Comparison • One cannot determine the similarity of two remotely homologous proteins by sequence comparison. • We try to superimpose one protein structure over another in order to obtain the minimum rootmean square deviation(RMSD)between them. -> O(n4m4)
The ProtDex Method • Step 1: Extracting Information from PDB database • Step 2: Building Intra-molecular Distance Matrices • Design rationale: two protein structures are similar if their distance matrices are similar • Step 3: Cutting Fixed Matrices and Extracting Properties • Step 4: Building Inverted File Index
Step 1: Extracting Information • For each protein chain in PDB file: • PDB id - chain id; No. of AA residues; No. of SSEs • For each AA Residue: • 3D coordinate (x, y, z) of C carbon • For each SSE: • SSE type (Helix or Sheet); SSE Start position; SSE length
Step 2: Representation - Building Distance Matrices Protein 9xxxx with 7 AA residues
Step 3-1: Contact Patterns & Fixed-Size Matrices SSE(H) SSE(E) contact patterns Fixed-size matrix
Step 3-2: Extracting Properties • For the 2X2 sub-matrix starting at the cell (2, 2), we store the values: 8, HH, (3,3), (1,1), (1,1) • For the 2X2 sub-matrix starting at the cell (3,6), we store the values: 49, HE, (3,2), (1,2), (2,1), etc.
Step 4: Building Inverted File Index Implemented as sorted list
Searching a Protein Structure • S(Q,P) = WFMCount(Q,P) X WGSum(I,j) X Sigma(match(I,j)[ (WTerm(i) X max(match(a,b)^PdbIdb=P)( WArea(a,b) X WARatio(a,b) X WOrdinal(a,b) ) ] • WFMCountis to compensate the effect that the large proteins being matched and scored more frequently than the small ones. • WTerm is to add more weight to the query index terms that rarely occur in the database.
Discussion • Design: • representation of structures • scoring schemes • comparison algorithms • assessment of the results • Performance • Accuracy – SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family • Pros and Cons of ProtDex
Conclusions • Advantages: • Speed (need not to scan through each structure in the database) • Disadvantages: • Cannot provide the actual alignment • Storage overhead for the index structure (the entire index: 1.2GB) • Time requirement to build and update the index (building the entire index: 30min 38 sec)