380 likes | 404 Views
MAVID: Constrained Ancestral Alignment of Multiple Sequence. Author: Nicholas Bray and Lior Pachter. Outline. AVID MAVID Progressive alignment Constraints Tree Building Experimental Results. AVID: A Global Alignment Program. Fast Memory efficient
E N D
MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter
Outline • AVID • MAVID • Progressive alignment • Constraints • Tree Building • Experimental Results
AVID: A Global Alignment Program • Fast • Memory efficient • Practical for sequence for alignments of large genomic region • Sensitive in finding homologous regions • Specific and avoids the false-positive problems
Algorithm • Repeat Masking (Optional) • Finding Matches Using Suffix Trees • Anchor Selection • Recursion
Repeat Masking Match finding Recursion Anchor selection Enough anchors? Base pair alignment Split sequences using anchors
Repeat Masking (Optional) • RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) • Repeat matches • Clean matches Clean matches Repeat matches
Finding Matches Using Suffix Trees • Maximal repeated substring (Match) • Every subsequence that contains it is not repeated in the string • Maximal matches between two sequence • Pairs of matching subsequences whose flanking bases are mismatches • Transform
Maximal repeated substring Maximal matches between two sequence Transform
Anchor Selection • Eliminate noisy matches (those less than half the length of the longest match) • The left matches are ordered by • Long clean -> short clean -> long repeat -> short repeat
Anchor Selection • A variant of Smith-Waterman algorithm (no overlapping) • Gap score: 0 • Mismatch score: ∞ • Match score: 10 bp
Condition • There are still significant matches • The anchor set is >50% of the length of the sequence • Recursion • Otherwise • Needleman-Wunsch algorithm • No significant matches • Short sequence (<4kb) • Needleman-Wunsch algorithm • Long sequence • Trivial alignment (gap)
MAVID • Rapidly aligning multiple large genomic regions • Incorporating biologically meaningful heuristics • Sound alignment strategies
Method • Core: progressive ancestral alignment, which incorporate preprocessed constraint • Terminology • Match • Similar (may not exactly match) region between two sequences • Constraint • The order of positions of alignment
Standard progressive alignment • Compute the distance matrix by aligning all pairs of sequences • Build a phylogenetic tree (guide tree) from the distance matrix • Cluster • Midpoint method • Progressively align the sequence according to the branching order in the guide tree • Aligning two alignments • An alignment is viewed as a sequence
Key difference • Instead of aligning alignments, we first infer ancestral sequences of alignments using maximum-likelihood estimation within a probabilistic evolutionary model • maximum-likelihood estimation • a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set
Key difference • The ancestral sequences are then aligned with AVID • The scores of the Smith-Waterman step are assigned according to the branch length of the two alignments • The alignment of the ancestral sequences is then used to glue two alignments. Gaps in the ancestral sequences lead to gaps in the multiple alignment
Alignment A Ancestral A Ancestral B Alignment B AVID
AVID with preprocessed data • Gene predictions using GENSCAN • Protein alignments using BLAT • Finding exon matches without using suffix tree • In addition, the exon matches can be used shape the final multiple alignment
MAVID(Constraints, Tree building, and Experimental results) Speaker: 羅正偉 2005/12/07
Constraints(1/3) • Notation: ai ≤ bj This means that position i in sequence a must appear before position j in sequence b in the multiple sequence alignment.
Constraints(2/3) ai a cy c cx b bj If x ≤ y, then ai ≤ cx≤ cy ≤ bj ,and so ai ≤ bj by transitivity.
Constraints(3/3) • The above information can be used in the alignment of the ancestral sequences by requiring potential anchors between the sequences to satisfy the constraints.
Prime Constraints(1/4) • Consider every triplet of sequences (a, b, c) with a in u, b in v, and c not in x. • Every triplet can provide potential constraints for the alignment. • If there are n sequences, there are O(n3) such triplets. x Too many constraints! u v
Prime Constraints(2/4) • Actually, we don’t need to find all possible constraints, many of which will be redundant. • Instead, we wish to find a set of prime constraints • In this set, no constraint is implied by the others. • Such a set can be inferred from the homology map.
Prime Constraints(3/4) • If there are m sets of orthologous exons, then at node x there can be at most O(m) prime constraints. • The sets of all prime constraints can be found in O(mk2), where k is the number of leaves below x.
Prime Constraints(4/4) • Matches between the ancestral sequences that are inconsistent with this set of constraints can be filtered out in time O(N logm), where N is the total number of matches. • For typical values of m and k, the time taken computing and utilizing the constraints is negligible.
Tree Building(1/3) • Most multiple alignment programs require pairwise alignments of all the sequences to build in initial guide tree. (Quadratic number of sequence alignments) • We utilize an iterative method to obtain a guide tree using only linear number of alignments.
Tree Building(2/3) • The initial guide tree is selected randomly from the set of complete binary trees. • The sequences are aligned using this random tree, and then a phylogenetic tree is inferred from the resulting multiple alignment. • The above process is iterated until the alignment and tree are satisfactory.
Tree Building(3/3) • Instead of computing all pairwise alignments, only O(nk) alignments are necessary to perform n iterations with k sequences. • We found that for typical alignment problems, only a small number of iterations were necessary.
Experimental Results 1 • A human, mouse, and rat whole-genome multiple alignment. • A homology map for the genomes was built by C. Dewey, and was used to generate gene anchors and constraints. • Chromosome 20 was chosen because it aligns almost completely with mouse chromosome 2.
Experimental Results 1 (cont.) Coverage of human chromosome 20 RefSeq exons by the MAVID alignments. Of a total of 3927 exons, only six were not in the homology map. A total of 53.5% of the exons were covered by precomputed exon anchors in either mouse or rat. The remaining exons are mostly aligned by MAVID, resulting in 93.6% of the exons covered by alignment in either mouse or rat.
Experimental Results 2 • Alignment of 21 Organisms • We aligned 1.8 Mb of human sequence together with the homologous regions from 20 other organisms of a total 23 Mb of sequence. • Baboon, cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, mouse, opossum, pig, platypus, rabbit, rat, tetraodon, and zebra-fish.
Experimental Results 2(cont.) • The MAVID alignments were compared with MLAGAN, version 1.1(Brudno et al. 2003). • MLAGAN is the only other program we know of that is able to align the 21 sequences in a reasonable period of time.
Experimental Results 2(cont.) • MAVID and MLAGAN both aligned sequences correctly. • MAVID took 40 min, while MLAGAN took roughly 6h.