160 likes | 271 Views
A Parallel Solution to Global Sequence Comparisons. CSC 583 – Parallel Programming By: Nnamdi Ihuegbu 12/19/03. Brief Introduction. Human Genome Project (and others) -> Vast amount of biological data Venture: Computer Science and Biology (BCB) -> Genetic Databases (map,genomic,proteomic)
E N D
A Parallel Solution to Global Sequence Comparisons CSC 583 – Parallel Programming By: Nnamdi Ihuegbu 12/19/03
Brief Introduction • Human Genome Project (and others) -> Vast amount of biological data • Venture: Computer Science and Biology (BCB) -> Genetic Databases (map,genomic,proteomic) • Expected date of Completed map of human genome: end of 2003 • Next stage: Sequence comp. and Seq-Protein function. • Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).
Results - Sequence • Current Sequence Generation Technologies • Maxam-Gilbert (use chemicals to cleave DNA at a specific base/length) • Sanger (use enzymatic procedures to produce DNA based on specific base—i.e. length)
Sequence Comparison Methods • Types of Sequence Comparisons/alignmts. • Global (“How similar are these two sequences?”) • To find best overall alignment b/w two sequences • 1970: Needleman and Wunch (global, dynamic) • Shortcomings: in small similarities w/in 2 subseq. • Local (“What sequences in a database are most similar to this sequence?”) • To find the best subseq. match b/w two sequences • 1981: Smith and Waterman (local, dynamic) • Shortcomings: not computationally efficient, slow
Results - Sequence • Heuristic Search (Quick, Approximate) • Quickly search for “words” that match sequence. Then recursively perform local search on each matched word until no other matches • FASTA (1998), BLAST(1990) • Shortcomings: approximate not exact, E-Value (sig if <0.05)
Results – Sequence (CSC Implementation) • Sequence alignment can be represented as matrices and graphs (using rules and costs) • When converted into a directed acyclic graph, solution of the sequence alignment is the shortest-path with maximum value (max. path problem).
Sequencing (CSC Implementation) • Can be solved dynamically as a ‘running max score’ (RMS). • For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score) • Replace D(i,j) with max • Needleman-Wunch Dynamic Program Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1
Parallel Solution Work (Slaves) allocated in stripes
A A T T T T T T -1 -1 3 3 3 3 G G -1 -1 -3 -3 -3 -3 [Gap] [Gap] -2 -2 -6 -6 -6 -6 Parallel Solution (Cont’d) Allocating Strips in SubMatrix
A A T T T T T T -1 -1 3 -1 -1 3 G G -1 -1 -3 -1 -1 -3 [Gap] [Gap] -1 -2 -1 -6 -6 -1 Parallel Results Path: T A -1 G T -3 _ T -6 -10 Each cell in each strip computes maximum of NEIGHBORS (running max)
Improvements • Parallel Smith-Waterman (localized; start and continue while >0 then end); (BLAZE-Stanford). • Pipeline implementation on an actual Mesh Topology • Other possible data infrastructures to traverse data in search of shortest path (e.g. Trees -- specialized)
Improvements (Cont’d) • Faster means of comparing and aligning multiple sequences simultaneously (e.g. comparing novel protein sequence to family).