230 likes | 239 Views
This paper proposes an optimized numerical mapping scheme for accurately locating exons in DNA sequences. The scheme uses a quasi-Newton algorithm and a filter-based technique to identify the period-3 property of exons. The accuracy of the technique is evaluated using Receiver Operating Characteristic (ROC) curves.
E N D
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou ISCAS 2010, Paris Department of Electrical Engineering, University of Victoria, BC, Canada.
DNA • The instructions to build and maintain a living organism are encoded in its DNA. • DNA is composed of smaller components called nucleotides, namely, adenine, thymine, guanine, and cytosine (A, T, G, and C). • DNA comprises a pair of strands.
DNA (cont’d) • Nucleotides pair up across the two strands. • A always pairs with T and G always pairs with C. Symbolic representation of a DNA sequence.
Genes • Regions in a genome that code for proteins are called genes.
Exons and Introns • Genes are further split into coding regions called exons and noncoding regions called introns.
Location of Exons • Accurate location of exons in genomes is very important for understanding life processes. • The power spectra of DNA segments corresponding to exons exhibit a relatively strong component at This is known as the period-3 property. • Thus, exons can be located by mapping the DNA characters into numbers and then tracking the strength of the period-3 component along the length of the DNA sequence of interest.
EIIP Values • Earlier, we have used electron-ion interaction potential (EIIP) values in conjunction with a filtering technique for exon location. • Here, we propose the use of an optimized set of nucleotide weights, we refer to as pseudo-EIIP values, that significantly improve the accuracy of our exon-location technique.
Filter-Based Exon Location Technique • The DNA character sequence of interest is mapped onto a numerical sequence using EIIP values. EIIP Values • A narrowband bandpass digital filter with its passband centered at the period-3 frequency is used to filter the DNA sequence.
Filter-Based Technique (cont’d) The filtered output is an amplitude modulated signal, which is demodulated by filtering its power, , using a lowpass filter. The exon locations are identified as distinct peaks. Exon location system.
Receiver Operating Characteristic (ROC) Technique • The ROC technique is a tool for evaluating prediction techniques in terms of their performance. • It is based on metrics known as the true positive rate (TPR) and the false positive rate (FPR): and TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively, of the predicted exon locations relative to a set of known true locations.
ROC Technique (cont’d) • The TPR is plotted versus the FPR to obtain a point in the ROC plane as illustrated. • Since the TPR and FPR range from 0 to 1, the total area of the ROC plane is unity. ROC plane
ROC Technique (cont’d) • The northwest pole, (0, 1), represents perfect prediction and the goal of any prediction technique is to reach this point. • The area under the ROC curve (AUC) is a good indicator of the overall performance of an exon-location technique. The greater the AUC, the better would be the performance. ROC plane
Proposed Training Procedure • A better set of nucleotide weights can be obtained by maximizing the AUC corresponding to a training set of DNA sequences or, equivalently, by minimizing the quantity 1−AUC. • A quasi-Newton algorithm based on the BFGS updating formula was found to give good results. • Closed-form expressions for the objective function and gradient are not possible for this problem and, therefore, they are evaluated numerically.
Training Procedure (cont’d) • For consistency between the optimized nucleotide weights and the EIIP values, we need to ensure that • the four variables are always positive and • their numerical values are normalized at the end of each iteration such that their sum is always equal to the sum of the EIIP values.
Training Procedure (cont’d) • Positive values can be achieved by replacing each variable by its square in the objective function. • The normalization can be achieved by using the following scaling factor in each iteration: Constant 0.4741 is the sum of the actual EIIP values and the denominator variables are the current optimized nucleotide weights.
Model for ROC Curves • ROC curves are not continuous but can be approximated using an exponential model of the form Parameters and can be determined by minimizing the error function where and are points in the ROC plane.
Training Procedure (cont’d) The minimization can be performed using a quasi-Newton algorithm as before. Sample ROC curve and its approximation.
Results • Simulation were performed to optimize the nucleotide weights using a specific data set and then test the optimized weights on a nonoverlapping test set. • The data sets were chosen from the popular HMR195 database. • Of the 195 sequences in the database, we selected the 160 sequences that have been verified experimentally and divided them into two sets, the initial training set and a test set of 80 sequences each.
Results (cont’d) • Termination tolerance: 10-6 • Iterations for minimization of 1−AUC: 42 • Iterations for exponential model: 20
Results (cont’d) Pseudo-EIIP values EIIP values ROC curves corresponding to the actual and pseudo-EIIP values, obtained using the training set.
Results (cont’d) Pseudo-EIIP values EIIP values ROC curves corresponding to the actual and pseudo-EIIP values, obtained using a test set with no overlap with the training set.
Conclusions • A method for obtaining optimized nucleotide weights, referred to as pseudo-EIIP values, has been proposed for use in filter-based exon location in DNA sequences. • The pseudo-EIIP values were found to yield improved exon location with respect to the training set as well as a nonoverlapping set of DNA sequences. • The pseudo-EIIP values render the filter-based exon location technique a more useful computational technique that can be used by biologists as an alternative to expensive and laborious wet experimental techniques.