230 likes | 334 Views
Suffix Trees and Derived Applications. Carl Bergenhem and Michael Smith. SimpleScalar Suite. Linux Based Cache Simulator Allows for simulation of predefined cache environments
E N D
Suffix Trees and Derived Applications • Carl Bergenhem and Michael Smith
SimpleScalar Suite • Linux Based Cache SimulatorAllows for simulation of predefined cache environments • Cross-compiles code for SimulationThrough Linux GCC Fortran or C code can be compiled specifically for the SimpleScalar to allow complete execution of the code and keeping statistics
Sim-cache • General sim-cache • Code run through sim-cache uses the following paramaters • Number of sets in the structure • Block size • Associativity • Replacement policy • What this lets us do • Can simulate how well a program will perform on different types of CPUs in regards to cache simulation.
Idea of a Suffix Tree • A Suffix-Tree is a data structure that creates a path from the root to a leaf for each suffix of the input string. • Ex: A seven letter string will have seven leaves
Idea of a Suffix Tree • The internal nodes of a tree are created when the start of a suffix is the same as another suffix • Ex: From “banana”, “anana” and “ana” both start with “ana” so they can share the same path from the root until the end where they diverge
Building a Tree • Starting from an empty root, and building the suffix tree for “banana” • The first step...
Recap • As seen, it is a simple process in a number of iterations equal to the length of the input string to create the suffix tree
Use • Fast String Comparisons Can be made in a number of comparisons of a most the length of the second to be compared string.
REPuter • The REPuter algorithm is a genetic algorithm that uses the Suffix Tree to efficiently find maximal repeats
Maximal Repeats • A maximal repeat requires that within a string, there exists a substring that occurs at least twice and is at least of length equal to a set threshold length.
Example • With a threshold value of 2, the word “banana” has the following maximal repeats • “ana” appears twice • “an” appears twice • “na” appears twice
Use • Scientists use the REPuter algorithm to find common substrings within a genome sequence that are of a certain length. • A useful extension of this algorithm is to find similar substrings that can account for mutations in the DNA
How It Works • The REPuter algorithm uses the suffix tree structure by traversing the entire tree, and whenever it is on a node that represents a string longer than the threshold, it is a valid maximal repeat so long as that node has 2 or more children nodes
PSP Algorithm • Probe Selection Problem (PSP) Algorithm • Relies upon the Suffix Tree to function. • Contains a set S of genomic sequences. • In order to find an olignucleotide (probe) for each sequence, a suffix tree of all the sequences is used. • Allows the probe to be identified in such a way that hybridization can occur for a specific sequence and that sequence only • Also grants the temperature at which the hybridization can occur