170 likes | 461 Views
ProbCons : Probabilistic consistency based multiple sequence alignment. Genome Research 2005 By Serafim Batzoglou et. al Stanford University. Majid Kazemian. Multiple Sequence Alignment. Biologists need accurate tools for multiple sequence alignment to study evolution
E N D
ProbCons : Probabilistic consistency based multiple sequence alignment Genome Research 2005 By Serafim Batzoglou et. al Stanford University Majid Kazemian
Multiple Sequence Alignment • Biologists need accurate tools for multiple sequence alignment to study evolution • Conserved stretches of amino acid often indicate preserved 3D protein structure • Obtaining accurate alignment is difficult • High computational cost • Lack of proper objective function for measuring • Variety of heuristic strategies have been proposed including GA, SA ,DP and greedy approaches
ClustalW • The most popular heuristic strategies involve tree based progressive alignment (ClustalW) • Construct pair-wise alignment • Build a guided tree • Progressive alignment based on guided tree • Post-processing for iterative refinement • ad hoc sum-of-pairs schema for scoring • Error in the early stages of alignment propagate to the final alignment
Consistency: prevention is the best medicine • Every multiple alignment induces pairwise alignments which are necessarily consistent • We want to Incorporate multiple sequence information to guide pairwise alignment • E.g. adjusting the score for an xi-yj residue pairing according to support from zk that align to both xi andyj
Algorithm overview • Step 1: Computation of posterior-probability matrices • Step 2: Computation of expected accuracies • Step 3: Probabilistic consistency transformation • Re-estimate the PP by incorporating other sequences • Step 4: Computation of guided tree • Step 5: Progressive alignment • Post-processing- Iterative refinement
Step 1: Computation of posterior-probability • Uses pair-HMM model to generate alignment M emits two letters, one from each sequence Ix emits a letter from x that aligns to a gap Iy emits a letter from y that aligns to a gap
Step 2: Computation of expected accuracies • Compute maximum accuracy • Maximizing the expected accuracy of the reported alignment
Step 3: Probabilistic consistency transformation • Re-estimate the match quality scores (PP) by applying probabilistic consistency transformation
Step 4: Computation of guided tree • Constructs a guide tree for S by hierarchical clustering like UPGMA. • Uses expected accuracy as the measure of similarity between 2 sequences • Defines the similarity of two clusters as weighted average of pairwise similarities between sequences of the clusters
Step 5: Progressive alignment & Iterative refinement • Aligns sequences hierarchically according to the order in guide tree using transformed match quality scores • Randomly partitions alignment into two groups of sequences and realign
Some additional features • Estimates the reliability of alignment columns based on pairwise posterior probabilities • ProbCons-ext • Extra pair of insertion states to model long or terminal insertions
Results & performances of ProbCons • Benchmark alignment databases • BAliBASE 2.01 (Thompson et al. 1999a) • PREFAB 3.0 (Edgar 2004) • 1932 alignments averaging 49 sequences of length 240 • SABMARK 1.63 (Van Walle et al. 2004) • Measures of alignment accuracy • SP (sum-of-pairs score) • CS (column score)
Discussion & Conclusion • Dramatically improves in alignment accuracy • Uses a simple model of Seq. similarity (HMM) • Doesn’t incorporate biological knowledge such as position-specific gap scoring, evolutionary tree construction. It Can be more refined by incorporating these features. • Competitive but still high computational cost for long sequences • Can be used in RNA structure alignment and prediction