300 likes | 456 Views
INFORMS 2004. Optimization Approaches to HP Lattice Protein Folding. Hyun-suk Yoon Joel Sokol School of Industrial and Systems Engineering Georgia Institute of Technology. Table of contents. Introduction to Protein Folding Integer Programming (IP) Approach
E N D
INFORMS 2004 Optimization Approaches to HP Lattice Protein Folding Hyun-suk Yoon Joel Sokol School of Industrial and Systems Engineering Georgia Institute of Technology
Table of contents • Introduction to Protein Folding • Integer Programming (IP) Approach • Introduction to Constraint Programming (CP) • CP Approach • Discussion
Protein • Sequence of amino acids • Size: 30 ~ 10,000 amino acids, a few hundred amino acids on average • Fold into a 3D compact structure quickly in minimum energy state. • Exponential number of possible 3D structures.
Problem description How can we find a 3D structure of a protein given a sequence of amino acids?
Motivation 1. Design drugs Most drugs work by attaching themselves to a protein Knowing 3-D shapes of proteins will help to design drugs. 2. Detect misfolding • Proteins occasionally may not have the correct 3-D shapes. • Misfolded proteins are known as the causes of a number of diseases, i.e., Alzheimer’s disease and Parkinson’s disease.
Protein folding • How to figure out protein folding • Experimental techniques: X-ray crystallography and NMR spectroscopy • Computational techniques: i.e., Folding@Home • Protein Data Bank (PDB) • http://www.rcsb.org/pdb • Worldwide repository for 3-D structure data of large molecules of proteins and nucleic acids.
HP model and Lattice model • HP model • Hydrophobic or Polar • 20 types of amino acids: 8 H’s and 12 P’s • Lattice model • Locate each amino acid on a point of a cubic lattice. • Parity problem: triangular or diagonal lattice model.
HP lattice model • HP model + Lattice model: the simplest protein model - Advantage: use enumeration techniques to locate amino acids. - Disadvantage: low resolution, no explicit local interactions, equal bond length • Lau and Dill (1989): minimizing total energy in the HP lattice model = maximizing the number of H-H contacts.
Example of HP lattice model Hydrophobic amino acid Polar amino acid Peptide bond H-H contacts Number of H-H contacts = Number of adjacencies between hydrophobic amino acids (except for peptide bonds)
Literature review • Protein topology • Levitt and Chothia (1976) represent 2D structural topology of protein in a diagrammatic form. • Richardson (1977) shows the first systematic survey of protein topology. • HP lattice model • Lau and Dill (1989) study a HP model on the square and cubic lattice. • Berger and Leighton (1998) and Crescenzi et al. (1998) prove that HP lattice model is NP-complete.
Table of contents Introduction to Protein Folding • Integer Programming (IP) Approach • Introduction to Constraint Programming (CP) • CP Approach • Discussion
General model Max The number of H-H contacts s.t. 1. (Assignment) Each amino acid must occupy one lattice point. 2. (Non-overlapping) No two amino acids may share the same lattice point. 3. (Connectivity) Every two amino acids that are consecutive in the protein's sequence must also occupy adjacent lattice points.
Two IP models • Model IP-1: Uses the coordinate of each amino acid. • Model IP-2: Uses the direction (Up, Down, Left, Right). (0,1) (1,1) 3 2 (0,0) 1 Up Right 3 2 1
2-D vs 3-D • Often use 2-D model instead of 3-D and attempt to extend 2-D into 3-D. • Easily extend 2-D into 3-D in our models - Model IP-1: (x,y) (x,y,z) - Model IP-2: add two more directions – forward, backward.
xijk = 1 if kth amino acid is located at (i,j), 0 otherwise. yijd = 1 if two amino acids in (i,j) and in (i,j)+d are both adjacent, 0 otherwise. Max s.t. (Non-overlapping) (Assignment) (Connectivity) (Define y) binary Solving IP Models Defining decision variables Formulating the problem Preprocessing Running it with CPLEX
Computational results • Instance: 1PSV • 28 amino acids: one of the smallest human proteins. • Obtained data from PDB. • Truncate to different sizes: 12, 18, 23, 28. • Optimal solution:
Computational results (cont) • CPLEX Running times (seconds) - IP does not work well. - Take a long time to solve 23 and 28 amino acids instances.
IP did not work well • Why? - High degeneracy: there are a lot of structures having the same minimum energy. - Symmetry: IP formulation contains much symmetry. • CP is known better than IP where IP formulation contains much symmetry. • So move on to CP.
Table of contents Introduction to Protein Folding Integer Programming (IP) Approach • Introduction to Constraint Programming (CP) • CP Approach • Discussion
Concepts of CP • Constraint programming (CP) • Study of modeling and solving a system of logical constraints using search techniques. • Began in the 1980s as part of artificial intelligence research. • Two main procedures: domain reduction and constraint propagation
CP vs IP Advantages and disadvantages Unified methodologies with CP and IP have been designed in recent years.
CP previous research • Smith (1996) shows environments where CP may work better than IP. • Barták (1999), Smith (1995), ILOG Solver 5.0 manual (2000) show CP’s successful accomplishments in many applications. • Easton (2003) and Milano (2004) deal with combining CP and IP.
Three CP models • Model CP-1, CP-2: Use the direction (Up, Down, Left, Right). • Model CP-3: Uses the combination of coordinates. Up Right 3 2 1 03+1 = 1 (0,1) (1,1) 13+1 =4 3 2 03+0 = 0 (0,0) 1
Models Description • Model CP-1 Similar as IP models, but use max function and if-then function. • Model CP-2 Similar to CP-1 and makes the formulation simpler using Boolean function and absolute value. • Model CP-3 Use the alldifferent function.
How to solve the problem faster • CP strategies to solve the problem faster • Use a known solution. • Fix the direction from the first amino acid to the next. • Any two amino acids which have an even distance cannot be adjacent. • Two amino acids have an upper bound on their distance. • Variable ordering: Choose first the variables with the smallest domain.
Computational results Same instance as IP (1PSV): 12, 18, 23, 28 amino acids. Use ILOG Solver to run CP. N = 23 N = 28
Computational result - IP vs CP • IP vs CP best running times (seconds) - Models used: IP IP-1, CP CP-1 (with strategies). - CP is faster than IP with our models.
Proposed research 1. Try other CP approaches such as dual modeling and dynamic variable ordering. 2. Consider an unified methodology of IP and CP - Decompose the problem, and apply IP to one part and CP to the other part. 3. Attempt other approaches such as heuristic algorithm to find better bounds.
Contribution 1. Optimization field • Help to show how CP can be an alternative to or a complement of IP. 2. Biological field • Success of our research can help in the prediction of 3-D protein structures, which may assist in medical development.
Any questions? Hyun-suk Yoon Industrial and Systems Engineering, Georgia Tech hsyoon@isye.gatech.edu