100 likes | 194 Views
Identification of Domains using Structural Data. Niranjan Nagarajan Department of Computer Science Cornell University. Assorted Definitions of Domains. Subsequences that can fold independently into a stable structure. Structurally compact substructures.
E N D
Identification of Domains using Structural Data Niranjan Nagarajan Department of Computer Science Cornell University
Assorted Definitions of Domains • Subsequences that can fold independently into a stable structure. • Structurally compact substructures. • Functionally well-defined building blocks. • Evolutionarily conserved and reused fragments.
Protein Structural Domain Identification William R. Taylor
Basic Algorithm • Initial Assignment of Labels • Sequential residue numbering • Update of Labels • Termination Condition • Mean squared deviation of average between successive cycles < 10^-6 or number of iterations > (length of protein)/2
Update Formula • Sit+1 = Sit + step(t+1)*sign(jf(Sit, Sjt)) i. • sign(x) = 1 if x > 0, -1 if x < 0, 0 if x = 0. • f(Sit, Sjt) = • r/dij if Sjt > Sit and dij < r. • -r/dij if Sjt < Sit and dij < r. • 0 otherwise. • Step(x) = • 1 if x < N/2. • 2(N-x)/N if N/2 <= x < N. • 0 otherwise.
Example • Full lines indicate protein backbone. • Neighboring residues within radius r are connected by dashed lines. • Connections between i and i + 2 have been omitted for clarity. • Label evolution is done without inverse distance weighting.
Refinements • Median based smoothing with a window size of 21 to reclaim short loops of 10 or less residues. • Small domains reassigned by using the weighted mean values of its neighbors (weights are given using f.) • Domain recalculation repeated for at most five times.
Preserving -sheets • Matrix B of possible -sheet interactions between residues generated based on distance data and heuristics. • Weighted mean heuristic used to generate initial assignment of labels with the averaging being iterated to convergence. • Post-processing also done to badly broken -sheets.
Self-testing with fake homologs • Fake homologs generated by smoothing • Replacing central atom of triple by average. • Process repeated five times. • Domain assignments compared and similarity evaluated based on overlap score. • r optimized for best overlap score.
Extension to Multiple Structures • Algorithm is simultaneously run on structures corresponding to a multiple sequence alignment. • Labels are synchronized to the average of the labels at a position after each iteration.