10 likes | 162 Views
UCD(x,y ) =. CD(x,y ) =. Distance measure 1: d(S, Q) = max{c(SQ) - c(S), c(QS) - c(Q)} Distance measure 2: d*(S, Q) = Distance Measure 3: d1(S, Q) = c(SQ) - c(S) + c(QS) - c(Q ) Distance Measure 4: d1*(S, Q) =. NCD 1 (x,y) =. (B). (A).
E N D
UCD(x,y) = CD(x,y) = Distance measure 1: d(S, Q) = max{c(SQ) - c(S), c(QS) - c(Q)} Distance measure 2: d*(S, Q) = Distance Measure 3: d1(S, Q) = c(SQ) - c(S) + c(QS) - c(Q) Distance Measure 4: d1*(S, Q) = NCD1(x,y) = (B) (A) Fig.3 Phylogenetic tree showing consistency of tree topology between (A) with no mutation and (B) with 1, 5 and 10 % of random mutation (simulated) of the mitochondrial genome sequences. The distance matrix obtained with the distance measure 2 was used in constructing the Neighor-Joining tree. (B) (C) (A) Fig.4 Phylogenetic trees obtained with (A) Kolmogorov (B) Lempel-Ziv and (C) Sequence alignment techniques using the Chew-Kedem data set of 36 protein as in table 1. On Evaluating the Performance of Compression Based Techniques for Sequence Comparison RAMEZ MINA†DHUNDY BASTOLA†,* ANDHESHAM ALI†,* †College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182-0116 *Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE 68198-6495 rmina@unomaha.edu, dkbastola@unomaha.edu, hali@unomaha.edu ABSTRACT Comparing biological sequences remains one of the most important problems in bioinformatics. Sequence alignment has been the method of choice specially when comparing DNA and Protein sequences. This approach of comparative analysis has been used in detecting functional and structural similarities and for classification purposes. While the alignment method has proven to be a reliable approach for sequence comparison, it fails to produce accurate results in many cases, particularly when the input sequences are incomplete, have a high degree of dissimilarity, or contain large number of repeats or mobile subsequences. We conducted a study to evaluate the performance of compression based algorithms in detecting similarity among biological sequences. These algorithms use data compression techniques to generate a dictionary of non-redundant and non-overlapping words. Subsequently dissimilarity measures (complexities) are obtained by comparing the dictionaries. We implemented different compression algorithms including LZW and Huffman [1] and evaluated the distance between input sequences using Lempel-Ziv and Kolmogorov complexities [1, 2]. Using different datasets [1], we compared the results obtained from using Kolmogorov and Lempel-Ziv complexities against the gold standard trees. Additionally, the trees obtained through alignment and compression based approaches were compared. Our preliminary results show that compression based algorithms out perform alignment techniques in several datasets that contain highly dissimilar sequences. Fig 2. Steps involved in generating an exhaustive library of a given sequence ATGTGAATGCAT. The exhaustive history of a sequence is collection of a unique set of subsequence (words). The exhaustive history for these sequences would be as shown HE (S) = A.T.G.TGA.ATGC.AT HE (R) = C.T.A.G.GGA.CTT.AT HE (Q) = A.C.G.GT.CA.CC.AA HE (SQ) = A.T.G.TGA.ATGC.ATA. C.GG.TC.AC.CAA HE (RQ) = C.T.A.G.GGA.CCT.AT.ACG.GT.CA.CC.AA LZ complexity of a sequence S , as c(S) is equal to the number of resulting words of its exhaustive history. c(S)=6, c(R)=7 , c(Q)=7 , c(SQ) = 10 and c(RQ) =12 Distance is calculated in the following manner To determine the performance of various compression methods including the classical alignment based approach, Chew-Kedem data set were analyzed (Fig 4). This data set consists of 36 protein domains drawn from PDB entries of three classes (alpha-beta, mainly-alpha, mainly-beta). Implementation of the F-Measure[3] for cluster validation allowed for a comparison of the trees against the gold standard from NCBI (Table 1). A tree highly similar to the gold standard tree is expected to have a value closer to 1. MOTIVATION Classification of organism and construction of phylogenies are major activities in bioinformatics. Mutational events leading to the changes in nucleotide composition is the basis for molecular evolutionary studies and molecular phylogenetic studies include the use of genetic sequences and its comparisons. The most commonly used computation approach to classification and phylogentic studies using genetic sequence similarity involves pairwise local or multiple sequence alignment. Although sequence alignment has been the method of choice in comparitive sequence analysis and has also been used in detecting functional and structural similarities, this approach is not suitable for whole genomic sequence comparisons or sequence with long length or multiple targets. Literature review show many compression based methods being evaluated that would allow one to overcome the limitations associated with alignment based techniques. In this study, we implement alignment and compression based algorithms, including the dictionary based technique, on mitochondrial genome form 15 mammals and protein . Where c(S) is the LZ complexity of a sequence S, and c(SQ) is the LZ complexity of sequence S added to sequences Q. These distances would result in the distance matrices that would be used in building phylogenetic tree. Kolmogorov complexity Given two sequences x and y, we define the conditional Kolmogorov complexity as K(x|y) as the shortest binary program that computes x in terms of y. We also define the information distance ID between two sequences x and y as: ID(x, y) = max {K(x|y), K(y|x)} Kolomogrov complexity is not based on the number words but on the length of compressed sequences. It is a concept described as Universal Similarity Metric (USM), which can be approximated as Universal Compression Distance (UCD), Normalized Compression Distance (NCD) and Compression Distance (CD) as shown Then NCD(x,y) = min {NCD1(x,y), NCD1(y,x)} Where C(x) is the length of a compressed sequence x and C(xy) is the length of a compressed sequence xy (sequence y is added to the end of sequence x). These resulting measures of these formulas are used to build the distance matrices that will be used to construct the phylogenetic trees. EVALUATION and RESULTS To assess the feasibility of compression techniques in the comparison of sequences, phylogenetic trees were obtained (Fig 3 ) with mitochondrial genome sequence data from 15 mammals. The result obtained form original sequence data provided a baseline for comparison of the trees obtained with the data set that included simulated nucleotide mutation of the original sequences. Clustering of the mutated sequences with its parent was indicative of dependable comparison method TABLE 1. Evaluation of compression based (Kolmogorov, LZ) and Sequence alignment approaches using Chew-Kedem data set of 36 protein. The values are reflective of how good the tree topology is compared to the gold standard using F-measure of comparing trees. Fig.1 A flow diagram showing major steps used in data collection, analysis and evaluation. Two of the compression based and the classical alignment methods were used in sequence analysis. The relatedness between sequences was contained in distance matrices that were used in producing phylogenetic trees. METHODOLOGY Data compression is a process of minimizing the size of data using specific encoding schema. It is a highly useful and popular technique in reducing the use of expensive resources including hard disk space or transmission bandwidth. Many compression techniques are based on finding similar structures. Therefore, compression techniques have found its application in the analysis of genetic sequence where the data is comprised of long strings of deoxyribo-nucleotides (ATG and C) and consists of repeated structure. Additionally, when comparing two or more sequences degree of compression could be a reflection of the relatedness between two sequences. This concept is known as the compression complexity. Two know example of compression complexities are Lempel-Ziv (LZ) [2] and Kolmogorov complexities [1] where the former produces dictionary of words and the later is more flexible in choosing the compression technique. • CONCLUSIONS • A set of closely related organisms can be compared using DNA or protein sequences. • Compression based techniques for sequence comparison are competitive alternative to the widely used alignment based methods. • The Dictionary based comparison methods outperform other compression or alignment based techniques. REFERENCES [1] Paolo Ferragina, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini and Gabriel Valiente ”Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment” BMC Bioinformatics. 2007, 8:252. [2] Hasan H. Out and Khalid Sayood ”A new sequence measure for phylogenetic tree construction”, Bioinformatics. 2003, 19:2122. [3] Handl J, Knowles J, Kell DB: “Computational Cluster Validation in Post-Genomic Data Analysis. Bioinformatics 2005”, 21(15):3201-3212. LZ complexity: LZ complexity is defined as the least exhaustive history of a sequence and noted as c(sequence). With following sequences as example: S= ATGTGAATGCAT R= CTAGGGACTTAT Q= ACGGTCACCAA