1 / 43

L O O P P

L O O P P. T. Galor. INDEX The database of the program The database of LOOPP Inserting a new driver to LOOPP Inserting a new model to LOOPP The MAIN module The OPTION module The loopp_interf module The ALIGN module The THREAD module THE SEQ module The PDB module The MPS module

Download Presentation

L O O P P

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LOOPP T. Galor

  2. INDEX The database of the program The database of LOOPP Inserting a new driver to LOOPP Inserting a new model to LOOPP The MAIN module The OPTION module The loopp_interf module The ALIGN module The THREAD module THE SEQ module The PDB module The MPS module Developing a new potential (SVM,BPMPD,PCX)

  3. Index continue TE13 module The global variables The parameter file Installing LOOPP Running LOOPP Interpretation of loopp results Reference

  4. The database In this chapter I will talk about some of the main data structures defined in Loopp. The definition is given in the file db.hand the allocation and de-allocation of these structures is done in the file db.c.

  5. The protein

  6. Coordinates The options geometric_chain, C_alpha, C_beta define which of the coordinate set is loaded into the memory. The allocation of the vector is done in the file db.c. The yellow vector is allocated by zalloc_coord(). The red vectors are allocated by the routine init_coord(). The coordinates are read into the memory by read_xyz_loopp_format() defined in the file loop_interf.c. The default in loopp is to read Geometric side chain. NULL

  7. Each vector is of size MAX_CONTACT Contact Map CM First shell neighbor l Site 1 NULL Site 4 The contact map vector (the red vector in the picture) is generated during the allocation of the protein in alloc_info() if the option compute_CM is set on. The size of the red vector is as the number of residues in the protein. The set of yellow vectors are are allocated during Get_CM_for_a_prot() defined in the file cm.c. The last routine read the CM if the file exists or generate the file and load the cm to the memory. First shell neighbor g The first shell neighbor g/l contains for each site the number of contact greater/less then the site index respectively.

  8. Count_2nd_shell_contact Id_2nd_shell_contact NULL NULL Each cell contain the the multiplicity of the corresponding structural site in the vector ID_2ND_shell_contact. In the example there are 2 contact of type 2 and 1 contact of type 3 in contact with site 1 Each cell contain the value of structural site in contact with site 1. For example for THOM2 there are 16 different types structural sites numbered from 0 to 15. The red vectors are allocated during alloc_info if the option read_CM is set on . The yellow vectors are generated with get_thom2_env_per_site() defined in the file env.c. One can also imagine a different structural site than that of thom2.

  9. The model: Is a set of information that describes rules to calculate the protein structural environment site, the cost of an alignment, the constraint. Energy model Alphabet_HP[2]={HYD,POL} Model HP_M M_env_HP[2]={15,15} Base_score[2]={0,15} db.c:alphabet={ALA,ARG,ASN,ASP,CYS,GLN,GLU,GLY,HIS,ILE,LEU,LYS,MET,PHE,PRO,SER,THR,TRP,TYR,VAL,GAP,GINS,GDEL,HYD,POL,GLX,ASX,CHG,CHN,CST,HST,USR1,ACE,MSE,UNK} There might be more then one energy model per model. In this case we have a mix model

  10. The cost matrix The program stores the values of the potential in the energy model during the call of the routine. set_****_attributes(…) The potential values are read by the routine read_scoring_matrixes(…) Matrix is a 2 by 2 vector which contain the potential of the current Energy model. dim1=2; dim2=2; dim3=0; symmetric=NO with_gap_score=NO A value in the matrix is accessed using the macro INDEX_POTEN defined in the file db.h index=INDEX_POTEN[res,env_x,base_score]= base_score[res] + env_x.

  11. The model continue There might be more then one model trained stimulatingly TRAIN_INFO include model for training a vector of flag include_model indicate which alphabet is trained to_train calculate constrains coefficient per site calculate a constraint get_constraints_coef_of_site define_ineq convert feature to a environment name score matrix dig2env cost

  12. The alignment mvs ALIGNMENT align length begin 1 asses input ALIGN TRACE align_len begin2 M= match, D=delete I=insert Local alignment start on different location for the two protein alignment input protein column protein row prot_col prot_row assessment alignment assessment use loopp index pdb2loopp index 1 use_loop_indx pdb2loopp index2 Average energy Zscore score compute Zscore energy post ene ene post score post zscore alignment type alignment id #ins #del #match global/local thread/seq/struc/ identity hydrophobic polarity num of gap segment charge num of mismatch rms

  13. The database Database is used when all protein are stored in the memory. The data base list is used when only one protein at a time is stored in the memory. F_xxxx, stands for the pointer to the file and f_xxxx stand for the file name. The data base List is initialized with Init_read_db(), and each new protein is read into the memory with read_nxt_prot() . After all proteins are processes we clean the list with the routine finish_reading_db(). The database is allocated in the file db.c with zalloc_db() and the data base is read into the memory with the routine Build_protein_db_from_file() defined in loop_interf.c. The proteins are read from a file containing a list of pdb name including chains.

  14. The decoy A decoy is a set of two proteins and their alignment method. The alignment can be an Identity alignment of SN into XN, a threading alignment of SN into XD or the Sequence alignment. The alignment energy We calculate the total LHS energy, RHS energy and the coefficient vector, given an initial guess for the score. The coefficient vector C counts the number of assigning an amino acids ai to structural site xj.

  15. The constraint to train A pseudo protein is defined by a decoy, where a decoy is a set of two proteins and their alignment. The equation is defined as the information of decoy1 subtracted from decoy 2. Loopp outputs three files for training: the RHS file, the LHS file and the Log file. In the Log file we save the norm of the two coefficient vectors the distant and energy, In the LHS file we save the left hand side of a constraint In RHS file we save the right hand side value.

  16. The Database of LOOPP. Loopp has a set of about 3888 proteins that span the known folds of the PDB. The folds are 6 Ǻ apart, found by LOOPP v1 structural alignment and are updated using CE program from time to time. The data base is stored at H:\\CBSU\LOOPP\DB\DB_jm on the theory center cluster. The list of the proteins of jm_list is given in H:\\CBSU\LOOPP\LIST\jm_list. In the data base we have so far four types of data. Each file starts with header containing the name and the chain of the protein accompanied with the number of residues. The file ****.seq contain a list of the amino acids. The file ****.xyz contains the coordinates. There are 9 columns in the coordinates file. The first three columns correspond to the (x,y,z) of the geometric side chain. The next triplet correspond to the C alpha coordinates and the last triplet correspond to the C beta coordinates. Missing coordinates are designated by 999.9. The next file is ****.2nd which contains secondary structure which is produced by DSSP program. This file contain 5 columns. The first column has the name of the amino acid, the second column contains the secondary structure: A for alpha helix, B for beta sheets and X for the others. The last three columns are the dihedral angles. The number 3600 is used for unknown angle. The last file contains the surface exposure ****.surf. Updating the database The database is updated using the Perl script DB.pl found in H:\users\galor\loopp\perl. In order to run the script the user has to set some of the parameters in the perl script.

  17. Inserting a new driver to LOOPP

  18. In this section we will explain how to insert a new driver in LOOPP. A driver is a function that a user can choose from the startup menu. As an example for a driver is: threading list of sequences to the database. As one can see, from the above figure (), a driver consists of several components. We start with the first component set option. Default options are set at the beginning of the LOOPP program in main.c. Some of these options are set according to the choice of the user of the program, and the programmer sets the rest. Some of the options are driver dependent and are set by the programmer in the driver. Lets return to our example of threading: These options are translated to the alignment type is threading and we don’t want to compute the contact map (CM) of TE13. The next step is to set the model, which define the energy function for LOOPP. In the first example the model is set according to the user wish and in the second example:

  19. The programmer can set the model type, the potential type, and alphabet of the model. The programmer can decide if the model is to be trained, in this case, space is allocated for the training information when the variable train is set to YES. Next, we read the database of structures to the memory of the program. To this end, we allocate the space with alloc_db(). We attach the model to the database and set options for the program to read all files connected to structures with set_struc_option (). Next we prepare to load only a portion of the database in case Loopp is run with several processors. A subset list of structures is created with take_portion_of_db_based_on_number_of_processes (op, io_in); finally, we build the database of structures, with the routine build_protein_db_from_pdb_list (db_structure, io_in, io_out, op); The data base can be divided on several processors Next, we load the list of sequences into the memory in the same manner.

  20. We again allocate memory for the database and assign it to the variable db_sequence. Set the appropriate model to db_sequence. Then inform the program only to load the relevant information for sequence with set_option_seq (op, model). Next, we copy the list of sequence defined by the user in op->list_pdb_file, into the variable, Io_in->f_current_list. Finally, sequences are read in LOOPP format or FASTA format according to user setting.

  21. Inserting a new model to LOOPP We start with the smallest component of a MODEL, the ENERGY_MODEL_TEMPLET. The energy template contains definitions of the protein and operations. In addition it contains the cost function and its parameters for calculating an alignment of two proteins belonging to the same model. The name of the model is stored in the variable model_type. The definition of protein is given by its list of residues the ALPHABET and its structural site by *env. As an example for a valid alphabet, alphabet_20_ins_del, which has the twenty usual amino acid types and two gaps namely insertion and deletion. The size of alphabet is stored in n_alphabet. The *env counts the number of different environment per site. In the example below, each amino acid (ac) has SEQZ sites and each gap has THOM2Z1 sites. In this particular model gaps are treated differently then ac. Gaps are assigned to THOM2 structural site.

  22. Next we list the operation that can be used on a protein. Theses are routines that must be programmed for every new model in order to function smoothly in LOOPP: • Get_gap_per_site (): If in a particular model, gap depends on structural site, then one can compute the total gap cost for each site a priori. This means, that at the time, the protein features is loaded into the memory, also gap cost are automatically computed. • Copy_struc_feature(): Every new protein feature beside the protein coordinates or resides, has to have a copy routine for that feature. As an example are secondary structure, surface exposure, or any new feature that will be added in the future. • Res2pos (): Convert residue number to its position in the ALPHABET vector. • Get_contact_types_for_multiple_env_per_site(): Computes number of contacts in case of multiple environments per site. So far it was used only for THOM2 structural site. • Std_residue(): Checks if an amino acid name is standard for that model. • Recall, that ENERGY_MODEL_TEMPLET contains also the cost function and its parameters. The parameters are stored in the variable cost and are loaded from the file, which its name is stored in f_potnat the time set_option_align () is called. The cost function is divided into two parts, that for the amino acid and that for the gap. These routines must be also written for any new model introduced to LOOPP. • Get_energy_cost (): Calculate the energy for assigning an ac to a structural site. • Get_gap_cost (): Calculates the energy for assigning a gap to a structural site.

  23. As an example of a new model we have here a sequence model with gap depending on structural site of THOM2.

  24. In the file model.c some of the models are experimental and should be used with caution. Below is the list of available models for loopp: TE13: Set_model_te13_regular20 (); PDB: Set_model_clean_pdb () SEQ: Set_model_seq_alignment (); THOM2: Set_model_thom2_regular_20_gap (); Secondary structure: Set_model_2nd_struc (); Surface exposure: Set_model_surf_regular_20(); A model can be a mix of several models. As an example we will use the mix model of OT This section is plugged in the file model.c in set_model ().

  25. The main module The main routine has the following functions: Decipher the command line for loopp. Loopp.exe  Interactive mode Loopp.exe x.x loopp.par  Batch model Loopp.exe x.x loopp.par #proc proc_Id proc_Id  Batch mode, multiple processors Setting the options by the user with the function set_option(). Prints interactively the command option available with a short explanation. Calls for the driver depending on the command option. Print end message of LOOPP

  26. The option module Set_option() : read loopp.par and set the value for the structure OPTION. Set the pointer F_stdout (global variable) for redirecting the output to screen or to an output file. Set_option_seq() : set option before reading a sequence information. Set_option_struc(): set option before reading structual information of a protein. Set_option(): Parse the parameter file loopp.par. Every line in loopp.par starts with pond (#) for comment or with at (@) for option definition. #comment comment line @USR_PARAMETER value option definition The same option definition can appear several times in the file loopp.par with different value, yet the last definition only counts. Adding a new option to LOOPP: Add the structure option in the file db.hthe appropriate new option field Add to set_option() the following lines to parse the new option: As an example we add the new option field called parameter which accept real value number: if (strcmp("USR_PARAMETER",operator) == EQ){ sscanf(line,"%s%s%f",crd_opening,operator,&fval); fprintf(F_stdout,"%s\t\t\t%f\n",operator,fval); op->parameter = fval; }

  27. The module loop_interf The major task of this module is to add protein information to the memory of the program. build_protein_db_from_file() : Build protein database form old loopp format read_a_pdb_in_loop_format(): Store protein information given in new loopp format. get_prot_name(), read_header_loop_forma(), read_log_loop_format(), read_seq_loop_format(), read_xyz_loop_format(), read_surf_loop_format(), read_2ndstruc_loop_format() : Read the different files of loopp. build_protein_db_from_pdb_list() rm_path(), get_prot_len(), read_a_pdb_in_loop_format(), get_db_TE13_CM_from_pdb_list(), get_gap_per_site_for_db(), get_list_env_per_site_for_db()

  28. check_if_missing_coord() : Compute the percentage of missing coordinates. If the percentage is greater then the threshold set by op->check_percent_missing. Then the protein is diagnosed as corrupted protein and is not loaded to the memory. Compute the size as well as the edges index for the reliable chunk in a protein. Usually both edges of the protein contain a lot of missing coordinate. This edged are trimmed and not used for training new potential. Convert old loopp format to new loopp format, printing routine: prn_db_in_loop_format(), prn_db_in_old_loop_format(), drv_transform_nloopp_to_oloopp_format(), drv_get_list_from_old_loop_format(). define_a_sublist(): In case loopp is run on several processors. This routine calculate the portion of the database list to load in to the memory for a specific processor. Fasta format read_seq_list_fasta2loop_format(), read_seq_prot_fasta2loop_format(), Load one protein at a time to the memory in case of insufficient of memory:loopp database init_read_db(), read_nxt_prot(), finish_reading_db(), LM_read_a_pdb_in_loop_format(), LM_read_xyz_loop_format().

  29. The align module How to use the align module Set alignment attributes set_***_attributes(&align,model,op,io_in,mvs,indx1,indx2); len_list=0 Can be seq/thread/.. clean_align_list(alignment_list, len_list); len_list=0;

  30. The dynamic matrix The dynamic matrix is allocated dynamically. It size depends on the query and the structure sizes. LOOPP has local and global algorithm implemented in align.c. The dynamic matrix is compute with the following routines : scoring_energy and gap_energy. Below one can see that cost is the sum of all existing energy_models, that are not post_energy model. (TE13 is considered as post_energy_model) float scoring_energy(OPTION *op, MODEL *model, PROTEIN *prot_col, PROTEIN *prot_row, int pos1, int pos2 ){ int k; float ret_score = 0.0; ENERGY_MODEL *m; for (k=0; k<model->n_ene_models;k++){ m = &model->ene_models[k]; if (m->get_energy_cost != NULL && !m->post_ene){ ret_score += m->unit_conversion * m->get_energy_cost(op,m,prot_col,prot_row,pos1,pos2); } } return(ret_score); } T= Prot_row Prot_col Align : Prot_col ------- Prot_row There are two routines for debugging the dynamic matrix. The first one prints the dynamic table to the screen. The size of the window is given by last four parameters. It must be inserted in local_align or global_align before the routines are exited. The second routine can be called after align(..) was called to see the energy of the alignment path. DEBUG(1, dbg_align_window(seq1_length,seq2_length,S,T,align_info->trace ,0,prot1,prot2,0,20,0,20)); DEBUG(1, dbg_align(seq1_length, seq2_length, S,T,align_info->trace,align_info->align_len)); S=

  31. Ene.dbg output file example index, align=M/D/I, Native, Structure, Ene, cost, count structural site, structural site Align protein1.seq.1 ---> seq.2 0 M TYRGLU: ene = -1.112 score=-1.112 ( 4 4) 1 MPHE GLU: ene = -1.112 score=0.000 2 MGLNASP: ene = -1.376 score=-0.264 ( 1 0) ( 1 1) 3 MGLY GLU: ene = -1.264 score=0.111 ( 3 4) 4 MHISGLU: ene = -1.163 score=0.102 ( 1 0) ( 1 1) 5 MMETGLU: ene = -1.456 score=-0.294 ( 2 1) 6 MASNPHE: ene = -0.866 score=0.591 ( 1 6) ( 4 7) ( 1 8) An example for the dynamic matrix align: 8fab_B---->8atc_A total_ene=405.881042 align_length=310 prot2=224 prot1=310 index of window printing prot2=[214 224] prot1=[300 310] TRACE ALIGN D D D D D D D D D D D D D D D D D D D D D D D D D D m m m m m D D D D D D m m D D m m m m m m m m m D m m m m m D m m m m m m D m m m m D m m m m D D m m m D m m m m m D m D m m D m m m m m m D m m m m m m m m m m D m m m m m m m m D m m m D m m m m m m m m m m m m m m m m m m m m m m m m m m m m D m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m m D m m D m D m D m m D m m D D m D m D m m D m m D m D m m m D m m m m m m m m m m m D D m m D D D m D D m m D m D m m m m m D m m D m m D D m m m m m m m m m m D m m m m D D m m D m D m m m m D D m m D m m m D m m m m m m m m m m m m D m D m DYNAMIC MATRIX FOR GLOBAL ALIGNMENT 300 301 302 303 304 305 306 307 308 309 310 LEU ALA LEU VAL LEU ASN ARG ASP LEU VAL LEU LYS 411.4 421.6 435.3 444.7 445.9 459.5 470.0 471.3 479.1 480.3 490.0 VAL 398.5 411.1 426.0 433.2 444.2 445.7 459.4 460.7 470.3 471.6 479.7 ASP 392.1 400.6 418.4 426.9 435.6 444.2 445.8 447.1 461.5 462.7 472.3 LYS 386.6 395.5 409.1 420.0 430.5 436.0 443.8 445.1 447.8 449.0 463.5 LYS 383.2 389.9 403.9 410.7 423.5 430.8 435.6 436.8 445.8 447.1 449.8 VAL 368.6 382.8 394.3 401.8 410.2 423.4 430.7 431.9 435.8 437.1 446.5 GLU 356.5 370.9 389.9 395.5 404.3 410.6 423.5 424.8 432.6 433.8 437.9 PRO 348.7 358.3 377.7 391.0 397.6 404.7 411.0 412.3 425.8 427.0 434.4 LYS 351.4 352.0 366.7 379.3 394.6 397.9 404.2 405.5 413.0 414.2 427.8 SER 340.3 352.7 358.3 367.6 381.0 394.8 398.1 399.4 406.5 407.8 415.6 CYS 324.1 338.7 355.9 355.6 366.1 379.4 393.9 395.1 396.9 398.2 405.9

  32. Computing the Z score The Zscore measure the homology of prot2 to prot1 with respect to random noise. The Z score is computed in the routine align(….) in the file align.c if (input->compute_zscore == YES){ srand(RANDOM_SEED); shuffled_prot = alloc_prot(op,MAX_SEQ); for ( k=0; k<n_rnd_alignments; k++ ){ shuffle_sequence(prot2,shuffled_prot); rnd_input->prot_row = prot1; rnd_input->prot_col = shuffled_prot ; if (align_data->input.alignment_type == GLOBAL){ I f (op->strucAlignment) rnd_ene = struc_global_align(&rnd_align,do_trace_back,op,model); else rnd_ene = global_align(&rnd_align,do_trace_back,op,model); } else if (align_data->input.alignment_type == LOCAL) rnd_ene= local_align(&rnd_align,do_trace_back,op,model); sumT += rnd_ene; sumT2 += rnd_ene*rnd_ene; } Shuffle the sequence residues of prot2 Add protein to align.input structure Compute random energy of aligning the random sequence into prot1 Compute average energy of aligning the random sequence into prot1 avT = sumT/n_rnd_alignments; avT2 = sumT2/n_rnd_alignments; norm = fabs(avT2 - avT*avT); align_data->assess.score = avT; if (norm == 0) align_data->assess.zscore = -999.9; else align_data->assess.zscore = -(align_data->assess.ene - avT)/sqrt(norm); if (align_data->assess.zscore < -999.9) align_data->assess.zscore = -999.9;

  33. Printing the statistic of aligning the query to LOOPP database. #Mon Jul 07 10:56:28 2003 #LOOPP V2: ALIGNMENT INFORMATION #====================================================== #This file contains statistics of sequence to sequence alignment #with constant gap penalty 8.000000 #and the potential is multiplied with the factor scale 1.000000 #Alignment type : GLOBAL #The following models were used: #Potential : NHseq_gte_thom2.pot #Model : SEQ_M with mixing parameter: 1.000000 #The model produced the alignment : YES # #Data Base : H:\users\galor\LISTS\test #The difference in length between the query sequence and the data base sequence is less then 30.00 percent # #The number of random sequence to compute zscore was set to 100 #Only prints zscore above threshold 0.00 # ======================================================== # 1 matches to 1dbt_A zscore ene identity te_ene te_zscore length align_len 7tim_A 0.11 -89.00 5.40 999.00 999.90 247 278

  34. The Threading module What is threading S1: AWGHKI Sequence information is used for the probe protein. Structural information for the target. G K I H X2: s1s0s2s3s0s3

  35. LIST OF FUNCTIONS: drv_threading_a_list_of_seq_against_the_db() : Thread a list of sequences against the database drv_threading_a_seq_against_the_db() : Thread one sequence against the database. LM_drv_threading_a_list_against_the_db() : Thread a list of sequences against the database (Low memory) drv_threading_a_seq_against_a_struc() : Thread a sequence against one structure drv_threading_a_db_against_itself() : Thread the database against itself used for recognizing native set_threading_attributes() : Set attributes for alignment of sequence to structure. thom2_gapless_threading_gap_penalty() : Compute gaps for Thom2 model REJM model thom2_threading_scoring_energy() : Compute scoring energy for Thom2.

  36. The seq module drv_seq_alignment_of_a_list_of_seq_against_the_db(): Align a list of sequence against the data base LM_drv_seq_alignment_a_list_against_the_db(): Align a sequence against the database (Low memory) drv_seq_alignment_of_db_against_the_db(): Align the database against itself (for recognizing the native) drv_seq_alignment_of_seq_against_seq(): Align one sequence against one sequence. set_seq_attributes(): Set attributes for sequence alignment. seq_alignment_gap_penalty(): Compute structural gap dependent penalty. seq_alignment_constant_gap_penalty(): Compute constant gap penalty constant_seq_alignment_gap_penalty_for_pdb_seq_to_atom(): Compute gap penalty for aligning SEQRES to ATOM section for a PDB file. seq_alignment_scoring_energy(): Compute scoring energy using Blusom 50

  37. The PDB module: The main task of this model is to create the database for Loopp. The database is created into stages. First step the PDB files are cleaned. IN the second step LOOPP files are created. The first step: A pdb file pdb****.ent is converted in to 2 files: pdb****.ent.logpdb****.ent.new. A clean PDB from the original PDB in pdb****.ent.new. A log file in pdb****.ent.log which contains information on the clean pdb. The later file contains lines of the form: "tag resName resSeqNum atomCounter gapIndicator CA-distance“, which describe how the file *.new was derived form the original pdb file *.ent <tag> is a character of +, -, =, or *, + stands for adding NTER and CTER card in *.new as chain designators - deleted residue in *.new = copied residue in *.new * copied residue but some of the atoms are missing in *.new. <atomCounter>: Display the number of atoms found for the current residue; <gapIndicator> : Display the index in a chain. A chain starts with index 1, and terminate with index 0, if no CA found at the current residue. <CA-distance>: Display C-alpha distance between previous and current residue. The created new files *.new and *.log are defined by the option = USR_PDB_PATH in the parameter file loopp.par.

  38. The routine which is responsible for cleaning the pdb is drv_clean_pdb_from_a_list_of_pdb_names(). It calls the interface routine openInterfaceToCleanPDB() in PDBparser.c file. Step 2: Generating loopp database The routine structure for parsing a pdb file containing all sections as defined by RCSB database:

  39. The main database in the file pdb.c : PDB_INFO: code_name : PDB acronym Chain_code_file : Chain identity (extract from the file name) Res_atom : A protein whose sequence is taken from SEQRES section Res_atom : A protein whose sequence and coordinates is taken from ATOM section. N_card : Number of cards in atom section N_atom_list[] : Display the number of atoms for the current residue Trace_pass2 : Pass 2 alignment trace Trace_original : Pass 1 alignment trace Trace_final : Final alignment trace gapMarkbond : Save gapIndicator from *.log file JumpMarkBond : Gap marker according to C-alpha bond length jumpMarkPdb : Gap marker according to bond residue index Align_info : Alignment of SEQRES section onto ATOM section Current_chain_id :The current chain in case there are several chains in the pdb DiscrepancyInJump : JumpMarkBond and JumpMarkPdb disagree flag MatchNotIdetical :In the alignment of SEQRES onto ATOM there is a match but the residue are not identical. (Error in the pdb file)

  40. Algorithm outline: The pdb file given from RCSB database is full of discrepancies. One way to fish out these problems is to align SEQRES section to the atom section residues. The program uses a sequence alignment with constant gap penalty and constant match score. After the first pass there is a need to check the alignment trace ( alignment path) if the alignment make sense. That is, gaps are concentrated in distinct area, gaps according to C-alpha distance correspond with jump in the PDB index, for match segment the program check whether the residue at the SEQRES section coincide with that of the ATOM section. The program tries to correct some of the errors in the pass 2, by shifting gaps, or using a different alignment not based on dynamic programming. The user can choose which alignment path make more sense based on reading the comments in the PDB file, or to manually make his own version if the two alignment fails. In case the program fails there is need only to correct small portion of the alignment.In this case the program prints section of the alignment at a time and wait for approval or correction. Unfortunately in rare occasion it might happened that the algorithm fails completely.

  41. MPS module Convert loopp format output to MPS format: The design of new potential leads eventually to solving a set of linear equations: Loopp generate LHS and RHS files which contain the coefficients of the inequalities. One of the options to solve these Set is using the software of BPMPD which requires MPS input format. Mps format MPS Here is a simple example of mps file: NAME example2.mps ROWS N obj L c1 L c2 COLUMNS x1 obj -1 c1 -1 x1 c2 1 x2 obj -2 c1 1 x2 c2 -3 x3 obj -3 c1 1 x3 c2 1 RHS rhs c1 20 c2 30 BOUNDS UP BOUND x1 40 ENDATA Read loopp LHS and RHS file into memeory Write bilinear objective : Write linear objective Define space of field Define variable name

  42. The driver to convert loopp to MPS format is drv_loopp2mps(). This routine calls to the different routines Depending on *.par file parameter. The most common routine is solving the linear set of inequalities with out objective: loopp2mps_nobj().

  43. Design new potential Gapless threading Create a equations of the type a_j[0]x[0]+...+ a_j[n]x[n] = r_j where a_j[0],...,a_j[n] (j=1...m) are stored in In the file op->train_lhs_file_nobj and and the rhs r_j (j=1...m) are stored in op->train_rhs_file_nobj. The equations are generated by gapless threading method assigning a seq into a structure with out gaps. The pair (seq_i,struc_j) construct a pseudo protein denoted as decoy. The equation is defined as the energy difference of assigning a native seq into a decoy structure (A non native structure) and assigning a native seq into a native structure. E(N->D)-E(N->N)=A*X > 0. The energy definition depends on the model chosen by the user in USR_MODEL_TYPE. The length of the seq should be shorter then that of structure. The sequence is sled into the structure (N_struc - n_seq +1) times or less depending on USR_GAPLESS_THREADING_WINDOW; There are two main routine for gapless threading: drv_compute_fix_threading_constrains(): Generate the LHS, RHS and LOG file from a LOOPP database drv_compute_fix_threading_constraints_where_the_db_is_based_on_abintio_decoys() Generate LHS, RHS, LOG file for abintio database. The difference is that the native is gapless thread to its family of decoys. As an example of decoys is the Skolnik set, TB set, and Baker set. The decoy length equal to its native.

More Related