150 likes | 248 Views
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo. Cory Tobin. Probabilistic Profiles. Ex: Hydrophobicity sliding window program
E N D
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin
Probabilistic Profiles • Ex: Hydrophobicity sliding window program • Scores for all characters within the window are added together and assigned to the character in question • Characters in window further away from the character in question are weighted less Ex: A T T C G G C T C 52.521
Overview • Time complexity of the brute force method is O(NP) • N=length of sequence P=length of profile • Looking for a more efficient way to score sequences with probabilistic profiles • Made algorithms that could work on compressed sequences • Use Run-length encoding and LZ78 compression • Decompressing sequences prior to scoring is not necessary • Test the algorithms on real sequences
Run-Length Encoding • Lossless compress method • Sequential repeats are saved as a single character and an integer representing the number of repeats • Only works well when there are lots of repetitive characters • Better compression ratio with nucleotides than with amino acids Ex: A T T T G C G C A A A A A T A T T C T C T C T G T G GA A A A A A C G A (T,3) G C G C (A,5) T A (T,2) C T C T C T G T (G,2)(A,6) C G
LZ78 Compression • Lossless compression method • Lempel Zif 1978 • Stores the data in a tree structure • Uses repeated patterns rather than sequentialy repeated characters • Better compression ratio than run-length • Compression algorithm is more complex than run-length ATA AT CA A CT C G A C G A T C
Brute-Force Scoring Algorithm A T A A T C A A C T C G 36 steps
Run-Length Scoring Algorithm A T A A T C A A C T C G 30 steps
LZ78 Scoring Algorithm A T A A T C A A C T C G 21 steps
Complexities Brute Force: O( N P ) Run-Length: O( ( N / lavg ) P ) LZ78: O( ( N / log N ) P )
Implications of Complexities • Complexities are based on the compression ratios of the sequences • If the compression ratio is 1:1 there is no reward for using the non-brute force algorithms • Sequences of equal length but higher compression will yield algorithms with lower complexities
64 Dollar Question How do these algorithms stack up against real sequences?
Methods • Randomly pick human DNA and protein sequences of varying lengths • Calculate the compression ratio using brute force, Run-length, and LZ78 methods • Run the algorithms on those sequences Characters in original sequence vs. characters in compressed sequence
Results • Run-length did not provide much advantage over brute force • LZ78 provided a great advantage over both brute force and run-length • Longer sequences yield better LZ78 performance compared to brute force • Both Run-length and LZ78 have lower complexities, therefore better performance, on DNA sequences rather than protein sequences
Pros and Cons • Less time is needed to perform probabilistic profile matching • Databases such as GenBank do not store their sequences in LZ78 or Run-length format • One would need to retrieve the sequence, compress it, then run the algorithm • This is probably worse than just using brute force on an uncompressed sequence