An analysis of “Using sequence compression to speed up probabilistic profile matching”

An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin

Probabilistic Profiles • Ex: Hydrophobicity sliding window program • Scores for all characters within the window are added together and assigned to the character in question • Characters in window further away from the character in question are weighted less Ex: A T T C G G C T C 52.521

Overview • Time complexity of the brute force method is O(NP) • N=length of sequence P=length of profile • Looking for a more efficient way to score sequences with probabilistic profiles • Made algorithms that could work on compressed sequences • Use Run-length encoding and LZ78 compression • Decompressing sequences prior to scoring is not necessary • Test the algorithms on real sequences

Run-Length Encoding • Lossless compress method • Sequential repeats are saved as a single character and an integer representing the number of repeats • Only works well when there are lots of repetitive characters • Better compression ratio with nucleotides than with amino acids Ex: A T T T G C G C A A A A A T A T T C T C T C T G T G GA A A A A A C G A (T,3) G C G C (A,5) T A (T,2) C T C T C T G T (G,2)(A,6) C G

LZ78 Compression • Lossless compression method • Lempel Zif 1978 • Stores the data in a tree structure • Uses repeated patterns rather than sequentialy repeated characters • Better compression ratio than run-length • Compression algorithm is more complex than run-length ATA AT CA A CT C G A C G A T C

Brute-Force Scoring Algorithm A T A A T C A A C T C G 36 steps

Run-Length Scoring Algorithm A T A A T C A A C T C G 30 steps

LZ78 Scoring Algorithm A T A A T C A A C T C G 21 steps

Complexities Brute Force: O( N P ) Run-Length: O( ( N / lavg ) P ) LZ78: O( ( N / log N ) P )

Implications of Complexities • Complexities are based on the compression ratios of the sequences • If the compression ratio is 1:1 there is no reward for using the non-brute force algorithms • Sequences of equal length but higher compression will yield algorithms with lower complexities

64 Dollar Question How do these algorithms stack up against real sequences?

Methods • Randomly pick human DNA and protein sequences of varying lengths • Calculate the compression ratio using brute force, Run-length, and LZ78 methods • Run the algorithms on those sequences Characters in original sequence vs. characters in compressed sequence

Results • Run-length did not provide much advantage over brute force • LZ78 provided a great advantage over both brute force and run-length • Longer sequences yield better LZ78 performance compared to brute force • Both Run-length and LZ78 have lower complexities, therefore better performance, on DNA sequences rather than protein sequences

Pros and Cons • Less time is needed to perform probabilistic profile matching • Databases such as GenBank do not store their sequences in LZ78 or Run-length format • One would need to retrieve the sequence, compress it, then run the algorithm • This is probably worse than just using brute force on an uncompressed sequence

End

An analysis of “Using sequence compression to speed up probabilistic profile matching”