ProbCons : Probabilistic consistency based multiple sequence alignment

ProbCons : Probabilistic consistency based multiple sequence alignment Genome Research 2005 By Serafim Batzoglou et. al Stanford University Majid Kazemian

Multiple Sequence Alignment • Biologists need accurate tools for multiple sequence alignment to study evolution • Conserved stretches of amino acid often indicate preserved 3D protein structure • Obtaining accurate alignment is difficult • High computational cost • Lack of proper objective function for measuring • Variety of heuristic strategies have been proposed including GA, SA ,DP and greedy approaches

ClustalW • The most popular heuristic strategies involve tree based progressive alignment (ClustalW) • Construct pair-wise alignment • Build a guided tree • Progressive alignment based on guided tree • Post-processing for iterative refinement • ad hoc sum-of-pairs schema for scoring • Error in the early stages of alignment propagate to the final alignment

Consistency: prevention is the best medicine • Every multiple alignment induces pairwise alignments which are necessarily consistent • We want to Incorporate multiple sequence information to guide pairwise alignment • E.g. adjusting the score for an xi-yj residue pairing according to support from zk that align to both xi andyj

Algorithm overview • Step 1: Computation of posterior-probability matrices • Step 2: Computation of expected accuracies • Step 3: Probabilistic consistency transformation • Re-estimate the PP by incorporating other sequences • Step 4: Computation of guided tree • Step 5: Progressive alignment • Post-processing- Iterative refinement

Step 1: Computation of posterior-probability • Uses pair-HMM model to generate alignment M emits two letters, one from each sequence Ix emits a letter from x that aligns to a gap Iy emits a letter from y that aligns to a gap

Step 2: Computation of expected accuracies • Compute maximum accuracy • Maximizing the expected accuracy of the reported alignment

Step 3: Probabilistic consistency transformation • Re-estimate the match quality scores (PP) by applying probabilistic consistency transformation

Step 4: Computation of guided tree • Constructs a guide tree for S by hierarchical clustering like UPGMA. • Uses expected accuracy as the measure of similarity between 2 sequences • Defines the similarity of two clusters as weighted average of pairwise similarities between sequences of the clusters

Step 5: Progressive alignment & Iterative refinement • Aligns sequences hierarchically according to the order in guide tree using transformed match quality scores • Randomly partitions alignment into two groups of sequences and realign

Some additional features • Estimates the reliability of alignment columns based on pairwise posterior probabilities • ProbCons-ext • Extra pair of insertion states to model long or terminal insertions

Results & performances of ProbCons • Benchmark alignment databases • BAliBASE 2.01 (Thompson et al. 1999a) • PREFAB 3.0 (Edgar 2004) • 1932 alignments averaging 49 sequences of length 240 • SABMARK 1.63 (Van Walle et al. 2004) • Measures of alignment accuracy • SP (sum-of-pairs score) • CS (column score)

Results on BAliBASE

Estimated Column reliabilities

PREFAB results

Discussion & Conclusion • Dramatically improves in alignment accuracy • Uses a simple model of Seq. similarity (HMM) • Doesn’t incorporate biological knowledge such as position-specific gap scoring, evolutionary tree construction. It Can be more refined by incorporating these features. • Competitive but still high computational cost for long sequences • Can be used in RNA structure alignment and prediction

ProbCons : Probabilistic consistency based multiple sequence alignment