SURVEY PROJECT “ A Clustering Method for Repeat Analysis in Dna Sequences”

SURVEY PROJECT “A Clustering Method for Repeat Analysis in Dna Sequences” Natalia Volfovsky et al. (2001) By: Paola Gabriela Pesántez Cabrera. Survey Project 04/30 Spring 2014

Outline • Introduction • Problem Statement • Principal Contributions • Related Work • Methods • Results • Conclusions

Introduction and Motivation Dna Sequence Bioinformatics Volfovsky et al. (2001) proposed an algorithm to cluster repeats that are present in a DNA sequence.

Problem Statement • Cluster all located exact or approximate forward and palindromic repeats that appear within a genomic DNA sequence. • Input: • A complete or partial genome sequence. • Output: • Clusters of merging repeats. • This clusters are called classes. • A merging repeat can have one or more repeats. • Prototype of each repeat class • Most representative repeat of a class. • The union of all prototypes will provide a database of repeats of the complete genome provided as input.

Principal Contributions The principal contributions from the work of Volfovsky et al. (2001) are: • RepeatFinder algorithm to cluster exact or approximate repeats present in a complete or partial genomic DNA sequence into classes [1]. • RepeatFinder associated software tool [7]. • Repeat databases from small and large genomes particularly for Arabidopsis genome and rice genome (BAC sequences). Repeats in these databases were found to match with gene segments and annotated repeat sequences which allows further analysis of the input genome sequence.

Related Work • By the time that the paper was published, most of the work was focused just on finding tandem repeats or derivatives of them as we can see in: • “An algorithm for finding tandem repeats of unspecified pattern size” [9]. • “Tandem repeats finder: a program to analyze DNA sequences” [10]. • “An efficient algorithm for finding short approximate non-tandem repeats” [11]. • Even further work like: • “Beyond tandem repeats: complex pattern structures and distant regions of similarity” [12] pursued to just locate two complex pattern structures: variable length tandem repeats (VLTRs) and multiperiod tandem repeats (MPTRs) without clustering them. • “Piler: identification and classification of genomic repeats” [13] based on local alignments only, which implies that the complexity time and space of RepeatFinder was still better.

Related Work • By this time, there exist a lot of new methods and tools for particular parts of the problem for example: • Algorithms using Burrows Wheeler transform (BWT) to create: • Enhanced suffix arrays • Compressed suffix arrays That don’t occupy that much space as suffix trees [14], [15], [16] so they can be used to find the repeats [17], [18]. • New alternatives to BLAST such as: • HMMER (Hidden Markov models profiles) [19]. • CS-BLAST (sequence context specific BLAST) [20]. • Alternatives to perform fast alignment such as: • Burrows-Wheeler Aligner [21]. • BarraCUDA that utilizes parallelism of GPUs to accelerate the inexact alignment of short sequence reads to a particular location on a reference genome [22] . • Bowtie based on BTW indexing [23], among others.

Methods – Algorithm

Preprocessing (I) REPuter [5],[6] RepeatMatch – MUMer [4]

Preprocessing (II)

Preprocessing (III) (23,139,6) (67,153,6) (118,128,6) (16,126,6) (82,151,6) (77,116,6) (38,47,8)

Preprocessing (III) Partition Points Repeats

Merging Procedure Merge Information:

Merging Procedure Partition Points Repeat Map Conditions:G = 1 op = 0.75 Gap d(16,23)=23-22+1=2 Gap d(77,82)=82-83+1=0 Overlap d(126,128)=132-126+1=7 op(6)=4.5

Classification Process Repeat Map Class 1 16 77 116 126 Class 2 23 139 Class 3 38 47 Class 4 67 Class 5 77 116 Class 1 16 67 77 116 151 126

Blast Search and Further Merging • BLAST: Basic Local Alignment Search Tool [8] • The sequences obtained from the merging process are the input to run a BLAST search to run a all against all search of exact matches. • Classes are merged if the E-value (significance of the high score alignment) returned by BLAST for any of its sequences is less than a threshold when compared to any sequence in other class. • If a class appears in multiple similarity pairs, all these classes are merged.

Defining the prototype for a repeat class Repeat Map Class 1 16 67 77 116 151 126 Class 2 23 139 Class 3 38 47 For merging with gaps: The prototype should have the maximum length and the maximum number of subrepeats. For merging with overlaps: The prototype should have its length closer to the shortest merged repeat in the class and the maximum number of subrepeats.

Results (I) The algorithm was applied to two different genomes: • Arabidopsis genome: It consists of 5 chromosomes sequences. Exact repeats of 25bp were found in each sequence from 100,000 to 400,000. Then, merging gap with G = 25 and classification were applied, obtaining 5,000 to 7,000 classes per sequence. Once the classes were defined the prototype of each of them was also identified. All the prototypes for each sequence were combined in one database. Finally, a BLAST of all-against-all search was performed using the prototypes database, generating a final classification of 5,000 classes with 3 or more merged repeats. The total number of repeat sequences was 105,434 from which 2,214 matched an annotated repeat sequence and 25,149 matched a segment of Arabidopsis gene. The largest repeat class contained 30,975 sequences of which 6,505 matched gene segments and 1,723 matched annotated repeats. • Rice Bacterial Artificial Chromosome(BAC) genome: A rice repeat database was reported by Yuan et al. [24] containing 215 sequences. RepeatFinder algorithm proposed by Volfovsky et al. [1], [7] enlarged this set. The input was a collection of 101,562 BAC sequences of length range from 400-700bp, a single sequence was generated join them where each original sequence is represented by its coordinate in the new one. The system found 5,208,206 exact repeats where the maximum length of each repeat was bounded by the length of the sequence in which it was found. Then merging overlap with op = 0.95 and classification were applied resulting in 48,768 repeat classes. The final database contains the prototype of each class. Finally, a BLAST search of annotated repeats against the rice repeat prototypes database was performed to validate the accuracy of the generated database.

Results (II)

Conclusions • RepeatFinder is a simple algorithm that accomplishes the goal of identifying and returning clusters (classes) of repeats that have been found within a DNA sequence. • These repeats form a novo database for the input genome which allows biologist and researches to refer to it in order of validate new methods and to get a deeper understanding of the perturbations present in a genome. • Even though the computational complexity of this algorithm is not that bad, the new methods developed for the different parts of the algorithm will for sure increment the performance.

Internet: http://caballerotrueno.wordpress.com/2013/05/07/hoy-solo-quiero-decirte-gracias-salmo-137/ Internet: http://thatssodad.blogspot.com/2012/09/104-answering-questions-with-question.html

References (I) [1]N. Volfovsky, B. J. Haas, and S. L. Salzberg, “A clustering method for repeat analysis in DNA sequences,” Genome Biol, vol. 2, no. 8, pp. 0027–1, 2001. [2] M.-Y. Leung, B. E. Blaisdell, C. Burge, and S. Karlin, “An efficient algorithm for identifying matches with errors in multiple long molecular sequences,” Journal of Molecular Biology, vol. 221, no. 4, pp. 1367–1378, 1991. [3] S. Kurtz, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich, “Computation and visualization of degenerate repeats in complete genomes.” in ISMB. Citeseer, 2000, p. 228238. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36.3811&rep=rep1&type=pdf [4] “The MUMmer 3 manual.” [Online]. Available: http://mummer.sourceforge.net/manual/#repeat [5] S. Kurtz and C. Schleiermacher, “Reputer: fast computation of maximal repeats in complete genomes.” Bioinformatics, vol. 15, no. 5, pp. 426–427, 1999. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/15/5/426.abstract [6] “BiBiServ2 - REPuter.” [Online]. Available: http://bibiserv2.cebitec.uni-bielefeld.de/reputer [7] “RepeatFinder home page.” [Online]. Available: http://www.cbcb.umd.edu/software/RepeatFinder/ [8] “BLAST: basic local alignment search tool.” [Online]. Available: http://blast.ncbi.nlm.nih.gov/Blast.cgi [9] G. Benson, “An algorithm for finding tandem repeats of unspecified pattern size,” in Proceedings of the second annual international conference on Computational molecular biology. ACM, 1998, pp. 20–29. [10] G. Benson, “Tandem repeats finder: a program to analyze DNA sequences,” Nucleic acids research, vol. 27, no. 2, p. 573, 1999. [11] E. F. Adebiyi, T. Jiang, and M. Kaufmann, “An efficient algorithm for finding short approximate non-tandem repeats,” Bioinformatics, vol. 17, no. suppl 1, pp. S5–S12, 2001. [12] A. M. Hauth and D. A. Joseph, “Beyond tandem repeats: complex pattern structures and distant regions of similarity,” Bioinformatics, vol. 18, no. suppl 1, pp. S31–S37, 2002. [Online]. Available: http://bioinformatics.oxfordjournals.org/ content/18/suppl 1/S31.abstract [13] R. C. Edgar and E. W. Myers, “Piler: identification and classification of genomic repeats,” Bioinformatics, vol. 21, no. suppl 1, pp. i152–i158, 2005. [Online]. Available: http://bioinformatics.oxfordjournals.org/content/21/suppl 1/i152.abstract

References (II) [14] V. Makinen, G. Navarro, and K. Sadakane, “Advantages of backward searching efficient secondary memory and distributed implementation of compressed suffix arrays,” in Algorithms and Computation. Springer, 2005, pp. 681–692. [15] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “The enhanced suffix array and its applications to genome analysis,” in Algorithms in Bioinformatics. Springer, 2002, pp. 449–463. [16] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, “Replacing suffix trees with enhanced suffix arrays,” Journal of Discrete Algorithms, vol. 2, no. 1, pp. 53–86, 2004. [17] F. Franek, W. F. Smyth, and Y. Tang, “Computing all repeats using suffix arrays,” J. Autom. Lang. Comb., vol. 8, no. 4, pp. 579–591, Jul. 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=998223.998227 [18] M. O. Kulekci, J. S. Vitter, and B. Xu, “Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree,” IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 9, no. 2, pp. 421–429, Mar. 2012. [Online]. Available: http://dx.doi.org/10.1109/TCBB.2011.127 [19] “HMMER.” [Online]. Available: http://hmmer.janelia.org/ [20] “CS-BLAST/ CSI-BLAST.” [Online]. Available: http://toolkit.lmb.uni-muenchen.de/cs blast [21] “Burrows-wheeler aligner.” [Online]. Available: http://bio-bwa.sourceforge.net/ [22] “BarraCUDA - a GPU accelerated DNA sequence alignment software.” [Online]. Available: http://seqbarracuda. sourceforge.net/ [23] “Bowtie: An ultrafast, memory-efficient short read aligner.” [Online]. Available: http://bowtie-bio.sourceforge.net/index.shtml [24] Q. Yuan, F. Liang, J. Hsiao, V. Zismann, M.-I. Benito, J. Quackenbush, R. Wing, and R. Buell, “Anchoring of rice BAC clones to the rice genetic map in silico,” Nucleic acids research, vol. 28, no. 18, pp. 3636–3641, 2000.

SURVEY PROJECT “ A Clustering Method for Repeat Analysis in Dna Sequences”