580 likes | 691 Views
Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings). Leong Lee, Ph.D ., Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA
E N D
Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO(Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA Jennifer L. Leopold, Ph.D., Department of Computer Science,Ronald L. Frank, Ph.D., Department of Biological Sciences,Missouri University of Science and Technology, Rolla, Missouri, USA
Introduction Central Dogma of Biology Protein Structure Prediction: A Brief Introduction Protein Secondary Structure Prediction Problem Related Work BLAST-ERT-RICO Exhaustive RT-RICO Rule Generation Algorithm References
What is life made of ? What are living organisms made of ?
Molecular Biology: A Brief Introduction • What is life made of? • Organisms are made of cells • A great diversity of cells exist in nature, but they have some common features (Jones and Pevzner, 2004) • Born, eat, replicate, and die • A cell would be roughly analogous to a car factory
Molecular Biology: A Brief Introduction All life on this planet depends mainly on three types of molecules: DNA, RNA, and proteins A cell’s DNA holds a library describing how the cell works RNA acts to transfer short pieces of information to different places in the cell, smaller volumes of information are used as templates to synthesize proteins Proteins perform biochemical reactions, send signals to other cells, form body’s components, and do the actual work of the cell. (Jones and Pevzner, 2004)
Central Dogma of Biology DNA --> transcription --> RNA --> translation --> protein Is referred to as the central dogma in molecular biology (Jones and Pevzner, 2004) DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function Regulatory mechanisms deliver the right amount of the right function to the right place at the right time (Lesk, 2008)
Molecular Biology: A Brief Introduction DNA: the structure and the four genomic letters code for all living organisms , double helix structure, can replicate Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G on complimentary strands (chemically attached) (Jones and Pevzner, 2004)
Molecular Biology: A Brief Introduction Cell Information: instruction book of life DNA/RNA: strings written in four-letter nucleotide (A C G T/U) Protein: strings written in 20-letter amino acid Example, the transcription of DNA into RNA, and the translation of RNA into a protein (Jones and Pevzner, 2004) DNA: TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop
Molecular Biology: A Brief Introduction Image courtesy of Griffiths et al. Genetic code, from the perspective of mRNA. AUG also acts as a “start” codon
Protein Structure Prediction: A Brief Introduction >1PSN:A|PDBID|CHAIN|SEQUENCE VDEQPLENYLDMEYFGTIGIGTPAQDFTVVFDTGSSNLWVPSVYCSSLACTNHNRFNPEDSSTYQSTSETVSITYGTGSMTGILGYDTVQVGGISDTNQIFGLSETEPGSFLYYAPFDGILGLAYPSISSSGATPVFDNIWNQGLVSQDLFSVYLSADDQSGSVVIFGGIDSSYYTGSLNWVPVTVEGYWQITVDSITMNGEAIACAEGCQAIVDTGTSLLTGPTSPIANIQSDIGASENSDGDMVVSCSAISSLPDIVFTINGVQYPVPPSAYILQSEGSCISGFQGMNLPTESGELWILGDVFIRQYFTVFDRANNQVGLAPVA Image courtesy of RCSB Protein Data Bank (http://www.pdb.org) 3D structure of pepsin (PDB ID: 1PSN)
Protein Structure Prediction: A Brief Introduction Genomic projects provide us with the linear amino acid sequence of hundreds of thousands of proteins If only we could learn how each and every one of these folds in 3D… Malfunctioning of proteins is the most common cause of endogenous diseases Most life-saving drugs act by interfering with the action of foreign protein So far, most drugs have been discovered by trial-and-error Our lack of understanding of complex interplay of proteins – drugs might not aimed at best target, hence side-effects (Tramontano, 2006)
Protein Structure Prediction: A Brief Introduction • Experimental methods can provide us the precise arrangement of every atom of a protein. • X-ray crystallography and NMR spectroscopy • X-ray crystallography requires protein or complex to form a reasonably well ordered crystal, a feature that is not universally shared by proteins. • NMR spectroscopy needs proteins to be soluble and there is a limit to the size of protein that can be studied. • Both are time consuming techniques, we cannot hope to use them to solve the structures of all proteins in the universe in the near future. • Problem: How to relate the amino acid sequence of a protein to its 3D structure. It is estimated that the human body may contain over two million proteins, coded for by only 20,000 - 25,000 genes. The total number of proteins found in terran biological organisms is likely to exceed ten million, but nobody knows for sure. Data is available on just over a million proteins. …wisegeek.com
Background – Protein Primary Structure Image courtesy of National Human Genome Research Institute (NHGRI) • Protein primary structures are chains of amino acids • 20 amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} • 1san:A • MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG
Background - Protein Secondary Structure • Secondary structure is normally defined by hydrogen bonding patterns • Amino acids vary in ability to form various secondary structure elements • 8 types of secondary structure defined: {G, H, I, T, E, B, S, -} Image courtesy of Carl Fürstenberg Alpha helices are shown in color, and random coil in white, there are no beta sheets shown >1SAN:A:sequence MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG >1SAN:A:secstr ----HHHHHHHHHHHHH-SS--HHHHHHHHHHHT--SHHHHHHHHHHHHTTTTTS-TT-S--
Protein Secondary Structure Prediction - Motivation Important research problem in bioinformatics / biochemistry High importance for design of drugs and novel enzymes Determination of protein structures by experimental methods is lagging far behind discovery of protein sequences Predicting protein tertiary structure is an extremely challenging problem, but tractable if using simpler secondary structure definitions; focus for current research (tertiary structure of a protein is its three-dimensional structure, as defined by the atomic coordinates)
Protein Secondary Structure Prediction Problem Description • Input (Baldi et al., 2000) • Amino acid sequence, A = a1, a2, … aN • Data for comparison, D = d1, d2, … dN • ai is an element of a set of 20 amino acids, {A,R,N…V} • di is an element of a set of secondary structures, {H,E,C}, which represents helix H, sheet E, and coil C. • Output • Prediction result: X = x1, x2, … xN • xi is an element of a set of secondary structures, {H,E,C} • 3-Class Prediction (Zhang and Zhang, 2003) • Multi-class prediction problem with 3 classes {H,E,C} in which one obtains a 3 x 3 confusion matrix Z = (zij)
Protein Secondary Structure Prediction Problem Description • 3 x 3 matrix (3 classes) Prediction H E C H Z11 Reality E Z22 C Z33 Zij: input predicted to be in class j while in reality belonging to class i Q total = 100 ∑i Zii / N (percentage)
Q3 Score • Q3 = Wαα + Wββ + Wcc Wαα= % of helices correctly predicted Wββ= % of sheets correctly predicted Wcc= % of coils correctly predicted • Example of Q3 calculation Protein: 10% helices, 10% sheets, 80% coils Prediction: 100% coils Q3 = 0% + 0% + 80% = 0.80
Q3 Score • Q3 = Wαα + Wββ + Wcc Wαα= % of helices correctly predicted Wββ= % of sheets correctly predicted Wcc= % of coils correctly predicted • Example of Q3 calculation, length 10 Amino acid (primary structure) sequence (A): MTYTRYQTLE (Secondary structure) data for comparison (D): HHHEEECCCC (Secondary structure) Prediction (M): HHEEECCCCC Q3 = 2/10 + 2/10 + 4/10 = 0.80
Related Work Not easy to evaluate the performance of a protein secondary structure prediction method (e.g., different datasets used for training and testing) Rost and Sander (1993a) selected a list of 126 protein domains (RS126); now constitutes comparative standard Cuff and Barton (1999) described development of non-redundant test set of 396 protein domains (CB396) PHD, one of the first methods surpassing the 70% accuracy threshold, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)
Related Work In evolutionary biology, homology refers to any similarity between characteristics of organisms that is due to their shared ancestry. Homology among proteins and DNA is often concluded on the basis of sequence similarity, especially in bioinformatics. For example, in general, if two or more genes have highly similar DNA sequences, it is likely that they are homologous. But sequence similarity may also arise without common ancestry: PHD effectively utilizes evolutionary information by exploiting the well-known fact that homologous proteins have similar 3D structures Random mutations in DNA sequence can lead to different amino acids in the protein sequences Mutations resulting in a structural change are not likely to retain protein function; thus, structure more conserved than sequence (Rost, 2003) Rost (2003) also has stated that a value of around 88% likely will be the operational upper limit for prediction accuracy
History of Prediction Accuracy The secondary structure prediction problem was first defined in the 1960s Before the 1990s, the prediction accuracy was only around 60% for most methods Recently, some methods have reached or even surpassed 80% accuracy (Q3 score), by utilizing evolutionary information of proteins, large databases, and various machine learning approaches such as artificial neural networks and support vector machines. How did we reach/surpass this 80% threshold?
Rost’s Neural Network (Rost and Sander 1993a) Image courtesy of Rost and Sander
Rost’s Neural Network (Rost and Sander 1993a) PHD, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)
BLAST-ERT-RICO Given input protein A (amino acid sequence, A = a1, a2, … aN), protein BLAST search (Web-based) performed using A as query sequence BLAST returns a list of proteins with significant sequence alignments Suitable proteins chosen to form training dataset for A RT-RICO algorithm generates rules from the training dataset; rules used to predict the secondary structure for protein A Output is predicted secondary structure sequence X
BLAST-ERT-RICO Step 1Online BLAST and PDB Data Match BLAST search (Web crawler program) performed using A as query sequenceSay A = APAFSVSPASGASDGQSVSVSVAAAGETYYI… Returns list of proteins with significant sequence alignments and corresponding BLAST scores; proteins with score ≤ 30 removed from list (test protein A also removed) Some of these proteins may have corresponding secondary structure records in PDB (Berman et al., 2000) Those retrieved records, become inputs to next step, data preparation If a protein from the list does not have known secondary structure record in PDB, we will require data from offline preprocessing
BLAST-ERT-RICO Step 2Data Preparation (Math content, skip) For test protein A, there is set of protein primary structure sequence Bi and set of corresponding secondary structure sequence Ci where Bi ∈ {B1, B2, B3, B4, … By},Ci ∈ {C1, C2, C3, C4, … Cy} Primary structure sequence is Bi = bi,1, bi,2, bi,3, … bi, wi Corresponding secondary structure sequence is Ci = ci,1, ci,2, ci,3, … ci, wi B1 to By are not necessarily of same length, because they represent different proteins Each bi,j is an element of a set of 20 amino acids, {A,R,N…V} ci,j is an element of set of 8-state secondary structures, {H, G, I, E, B, T, S, -} (PDB); converted to an element of a set of 4-state secondary structures, {H, E, C, -}
BLAST-RT-RICO Step 2Data Preparation (Math content, skip) If Bi is primary structure sequence, Ci is secondary structure sequence, and length of sequence(s) is wi, then each n-residue segment is of form: bi,j-floor(n/2), … bi,j-1, bi,j, bi,j+1, … bi,j+floor(n/2), ci,j; and j has value from ceiling(n/2) to (wi – floor(n/2)) This data preparation step performed for all Bi and Ci pairs, where i is from 1 to y These n-residue segments are main inputs to ERT-RICO rule generation algorithm
BLAST-ERT-RICO Step 2Data Preparation • Protein primary structure n-residue segments and related secondary structure elements representation (n=9)
BLAST-ERT-RICO Step 3Rule Generation +,+,+,L,+,+,+,+,S,E,84.21,19,16,0.93676815 +,+,+,T,V,+,+,+,+,E,76.47,51,39,2.28337237 Q,A,+,+,+,+,+,+,G,E,100.00,7,7,0.40983607 …… (3,L)(8,S) -> (9,E), 84.21%, occurrences of ((3,L)(8,S)) = 19, occurrences of ((3,L)(8,S) -> (9,E)) = 16, Support % = 0.93676815 (3,T)(4,V) -> (9,E), 76.47%, occurrences of ((3,T)(4,V)) = 51, occurrences of ((3,T)(4,V) -> (9,E)) = 39, Support % = 2.28337237 (0,Q)(1,A)(8,G) -> (9, E), 100.00%, occurrences of ((0,Q)(1,A)(8,G)) = 7, occurrences of ((0,Q)(1,A)(8,G) -> (9, E)) = 7, Support % = 0.40983607 …… • Sample rules generated by ERT-RICO (n=9, m=1708)
BLAST-ERT-RICO Step 4 Prediction • Protein primary structure n-residue segments and related secondary structure elements prediction (n=9) • Here xi is an element of the set {H,E,C,-}. It is then converted to an element of the set {H, E, C}.
BLAST-ERT-RICO Step 4 Prediction (may skip) The prediction algorithm is also dependent on the selection of the threshold value Suppose that a threshold value t = 0.8 (80%) is chosen The algorithm first searches for matching rules with 100% confidence value. The secondary structure element with the highest total support value (among 100% confidence value rules) is selected If no matching rule exists among 100% confidence value rules, the algorithm then searches for other matching rules (with confidence values greater than or equal to 90%) If no matching rule exists among those with confidence value greater than or equal to 90%, the algorithm searches for matching rules with confidence values greater than or equal to 80%
BLAST-ERT-RICO Step 4 Prediction (may skip) This lowering of threshold or confidence value (at a decreasing rate of 10%) stops at the threshold value t, in this case 80% (threshold = 0.8); it can go lower if the chosen t is of a smaller value The secondary structure element with the highest total support value among these rules is selected as the predicted secondary structure element for that specific position If no matching rule is found for the segment at all, the secondary structure of the previous position is used as the predicted secondary structure
BLAST-ERT-RICO, Offline Preprocessing(future work needed here) If no protein with significant sequence alignments has corresponding known secondary structure sequence from PDB (answer is “no” in Fig. 1.), prediction for test protein needs to be handled slightly differently All proteins and corresponding secondary structure sequences from PDB downloaded to form initial dataset; test datasets (RS126 or CB396) removed; protein domains from different protein families selected to form training datasets Now we have set of protein primary structure sequence Bi and corresponding secondary structure sequence Ci; same data preparation, rule generation, and prediction steps applied
Exhaustive RT-RICO (ERT-RICO)Rule Generation Algorithm Most computationally intensive Previously, this research team presented a prediction method, BLAST-RT-RICO Some areas of the algorithm were in need of improvement; most importantly, the time complexity for the rule generation step needed to be reduced RT-RICO has a time complexity of O(m22n), where m is the number of all entities (the number of rows of n-residue segments), and n = |S| (the number of attributes). m2 dominates the time complexity because n is a small value (9 for this case)
Exhaustive RT-RICO (ERT-RICO)Rule Generation Algorithm Sometimes a very large m can cause running time issues When we ran datasets with different n value and t (threshold) value combinations to find the optimal segment length and threshold value, we faced the challenge of running several datasets in a reasonable period of time We developed the Exhaustive RT-RICO algorithm (ERT-RICO), which is a modified version of the old RT-RICO algorithm, and has an improved time complexity of O(mlog(m)2n). mlog(m) dominates the time complexity ERT-RICO has a space complexity of O((2n-1)(20n)(4)); in practice the space required is much smaller than that, due to the fact that different segments generate a large number of duplicate rules
ERT-RICO Rule Generation Algorithm Space complexity could be an issue; n = 9, need (29-1)(209)(4), around 1.04653 × 1015 counters; we made data structure adjustments (different segments generate lots of duplicate rules) We know all possible values for each position in a segment (hence all possible rules) For an m×(n+1)matrix, each row (segment) is of length n+1 The first n elements are made up of letters from a set of 20 amino acid residues, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}, and the last element is a letter from a set of four secondary structure states {H, E, C, -} Convert a rule to a numeric value, and convert the number back to the original rule
ERT-RICO Rule Generation Algorithm • The ERT-RICO rule generation algorithm finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t 1), where S is the set of all attributes, and R is the set of all decisions. • The input to ERT-RICO is in the form of an m×(n+1)matrix, where m is the number of all entities (the number of n-residue plus one secondary structure element segments), and n = |S|(the number of attributes). Algorithm 2: ERT-RICO begin for each segment (each row of matrix) for each 2n-1 rules that can be generated from segment generate unique hash key which is a numeric index if hash index does not exist in the hash table then add hash index and hash value (1) to the hash table (hash value = number of occurrences of each rule) else update hash value in the hash table (hash value = hash value + 1) end-if end-for end-for for each key in the hash table generate rule from key (in amino acid and secondary structure letters) calculate confidence and support using hash value and related keys if confidence > t then add rule, confidence, and support to output file end-if end-for end-algorithm.
Conclusion ERT-RICO has an improved time complexity of O(mlog(m)2n) This improvement over RT-RICO’s O(m22n), has enabled the research team to run much larger test datasets with different choices of segment length and threshold value Preliminary test results showed that BLAST-ERT-RICO achieved a Q3 score of 92.19% on the standard test dataset RS126 Current optimal segment length: n = 9 Current optimal threshold: t = 0.8 The adoption of the ERT-RICO algorithm also resolves the space complexity issues of our earlier implementations(Hash table design eliminates the need of counters for all entries & individual counters for duplicate entries. Maximum hash table size is around 47 million entries => fits in RAM)
Conclusion The test programs (rule-generation and prediction for RS126 set, n=9) were written in PERL and executed on a computer with Intel Dual-Core processor, 32 GB of RAM, and Windows 7 OS The total program running time was approximately 21 days (which definitely can be improved in the future) Even with the use of standard test datasets, it is still difficult to compare the accuracies of prediction methods RS126 set is a very representative test dataset; all test proteins can generate a number of significant alignments through BLAST
Future Work – 71,000 proteins It is still difficult to compare the accuracies of prediction methods In early 2011, there were around 71,000 proteins (unique PDB IDs) with known secondary structure in the Protein Data Bank (PDB) database Most test datasets use only around 100 to 500 protein domains If all these 71,000 proteins can be used to evaluate a particular method, the resulting Q3 score should be well representative
Future Work – homologous protein selection So far (last few years), I used all proteins with significant sequence alignments (certain blast scores) to generate rules Result? long rule generation time This could be improved by developing a different algorithm for selecting proteins for rule generations Some work (algorithm design and programming) has already been done in this area