240 likes | 356 Views
Mapping Genomes onto each other – Synteny detection. CS 374 Aswath Manohar. Necessity is the mother of invention. Genome sequencing has given rise to voluminous amounts of genomic data. Human genome has completely been sequenced. Rat and mouse genomes have also been completed.
E N D
Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar
Necessity is the mother of invention • Genome sequencing has given rise to voluminous amounts of genomic data. • Human genome has completely been sequenced. Rat and mouse genomes have also been completed. • What do we do with all this data?
Necessity… • Need to analyze all this data meaningfully. • Has given rise to the field of Comparative Genomics. • Identification of functional DNA through comparative methods. • A large set of functional elements in Rat/Human/Mouse genomes remains uncharacterized. (Pash: Kalafus et al)
Analysis Methods • Standard Dynamic Programming Alignment algorithms – Needleman Wunsch, Smith-Waterman. • Highly sensitive aligners. • Computationally prohibitive – impossible to apply to analysis of multiple mammalian genomes.
Methods… • Faster implementations of dynamic programming such as LAGAN (Brudno et al 2003). • Works well on a megabase level, but requires prior information (‘anchors’) on a genomic scale. • Seed and extend methods – a ‘seed’, hotspot is determined. Then it is extended on either side. • Again, extension step is computationally expensive.
Pash • So what is the solution? • Use Positional Hashing!!! • Pash: Efficient Genome-Scale Sequence Anchoring by Positonal Hashing • Authors: Ken Kalafus, Andrew Jackson and Aleksandar Milosavijevic
More formally… • The sequences S, T are conceptually divided into sub-sequences of length L: • Si = [i*L+1,..., (i+1)*L] • Ti’ = [i’*L+1,..., (i’+1)*L]
Hashing • The single scoring matrix is divided into L diagonal matrices. • These are further divided into L ‘diagonal segment’ matrices. • We have L² ‘diagonal segment’ matrices. • We use a hash table for each ‘diagonal segment’ matrix. • Therefore Total #Hash tables = L²
Hashing… • Each k-mer is mapped to a bin in the hash table. • The indices of the k-mer are stored in one of two linked lists (one for each sequence). • We assume an efficient hash function.
Hashing… • If both the lists in a bin are non-empty, then the kmer corresponding to that bin, is a matching kmer! • Collation of matching kmers involves a single traversal of each list.
Running time • Worst case?? • When you have to perform an all against all comparison • O(M*N) • Highly unrealistic
Running time… • In practical applications, output size is O(M+N). • If k-mers of sufficient length are used, each of L² hash tables is populated with (M+N)/L k-mers. • Hence running time = O(M+N)*L) • If you have L nodes, running time = O(M+N).
Significance of Similarities • For each sequence found, Pash reports both the number of matching bases and a bit score that indicates significance. • The bit score is calculated according to the Algorithmic Significance method.
Significance of Similarities… • Based on the number of bits saved in a minimal encoding of the target sequence X=T given that the source is known. • D = Io(X) – I(X) • Io(X) = 2 * n bits
Kmer encoding… • To encode I(X), one of two options are used on a case by case basis. • A 1 bit flag is used to denote which method is used. • Let w be the number of matching kmers. • Let W be the maximum possible number of kmers in a match. • Conceptually, W corresponds to the length of the diagonal and is constant.
Kmer encoding… • There are C(W,w) possible lists of matching kmers. • To uniquely identify a kmer set we need log2C(W,w) bits • Therefore Kmer encoding of Iw(X): Iw(X) = 1 + log2W + log2C(W,w) bits
Base encoding • Base encoding is very similar to kmer encoding. • Let b the number of bases defined in a match. • Let B be defined as the maximum possible number of bases contained in a match. • Ib(X) = 1 + log2B + log2C(B,b) bits.
Significance of Similarities • Therefore Imin(X) = min(Iw(X), Ib(X)) • I(X) = Imin(X) + 2*(n-b) bits • Therefore, after combining and simplifying, d = 2 * b – Imin(X)
Results • Used in comparing the latest assembly of rat genome to the human and mouse ones. • Each pair-wise comparison took 4 days in 6 CPU’s = 24 CPU days • Computers were running on 750 MHz Pentium III processors • Peak Ram usage = 500 MB (approx)
Discussion • In contrast to seed and extend methods, Pash represents sequences as short kmers, rather than bases. • Efficiently parallizable. • Applications requiring basepair level alignments, Pash can be used as an anchoring module • This can in turn be post processed by programs like LAGAN, AVID or BLASTZ.
Availiability • Available free of charge for academic use. • http://www.br1.bcm.tmc.edu