290 likes | 313 Views
Aligning Multiple Genome Sequences With the Threaded Blockset Aligner. Blanchette, W., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., and Miller, W. Genome Research 2004. Outline. Introduction TBA MULTIZ
E N D
Aligning Multiple Genome Sequences With the Threaded Blockset Aligner Blanchette, W., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., and Miller, W. Genome Research 2004
Outline • Introduction • TBA • MULTIZ • How TBA was built • Evaluation of alignment accuracy • Accuracy of the Multiple Alignments • Experiment results
Introduction • Reference Sequence Idea • A sequence is fixed as the reference to which all other sequences are compared S1: A T G C T C S2: A G A G C S3: T T C T G S4: A T T G C A T G C S1: D(S1,S2) + D(S1,S3) + D(S1,S4) = 9 S2: D(S2,S1) + D(S2,S3) + D(S2,S4) = 12 S3: D(S3,S1) + D(S3,S2) + D(S3,S4) = 12 S4: D(S4,S1) + D(S4,S2) + D(S4,S3) = 11 S1: A T - G C - T - C S2: A - - G A - G - C S3: - T - T C - T - G S4: A T T G C A T G C S1: A T G C T C S2: A - G A G C S1: A T G C T C S2: A - G A G C S3: - T T C T G Efficient methods for multiple sequence alignment with guaranteed error bounds, Gusfield, D., Bull. Math. Biol., 1993, Vol. 55, pp. 141-54.
Benefit • Simplicity • Drawbacks • Regions conserved in a subset of the species, but absent from the reference sequence, are not identified. • Alignments generated with different reference sequences may be inconsistent. • Inconsistent: • Two positions that are aligned to each other using one reference sequence might be aligned to different positions when another reference sequence is chosen. S1: A T G C T C S2: A G A G C S3: T T C T G S4: A T T G C A T G C S1: A T - G C - T - C S2: A - - G A - G - C S3: - T - T C - T - G S4: A T T G C A T G C
TBA • Threaded Blockset Aligner • Block: • A local alignment of the sequences • Blockset: • A set of Blocks
1 400 201 200 300 101 1 400 96 Block 350 1 Blockset 51 146 TBA h: human (400bp) m: mouse (400bp) r: rat (350bp)
TBA • Thread: • A sequence S threads a blockset if every position in the sequence S appears exactly once in some block of the blockset. • Threaded blockset: • A blockset is threaded by each of the original sequences.
1 400 201 200 300 101 1 400 96 Block 350 1 Blockset Threaded blockset 51 146 TBA h: human (400bp) m: mouse (400bp) r: rat (350bp)
Threaded Blockset Ref-blockset • Ref-blockset: • A Blockset where every block has a row from a particular sequence which is designated as the reference for that ref-blockset. • Projection: • Given a thread blockset, generate an S-ref blockset for any sequence S.
Threaded blockset m-ref-blockset h-ref-blockset TBA • Any two ref-blocksets generated by projection from the same threaded blockset are consistent.
TBA • Threaded Blockset Aligner • TBA produces a set of blocks in which each position in the given sequences to be aligned appears once and only once. • Any detected match among some or all of the sequences is represented among the blocks, and mutually consistent reference-sequence alignments can be extracted at will.
Alignment between the chloroplast genomes of Arabidopsis thaliana(阿拉伯芥) and Oenothera elata(月見草) by PipMaker. • Blocks of a threaded blockset for the chloroplast genomes of Arabidopsis thaliana(a) and Oenothera elata(p).
Applying TBA to vertebrate HOX clusters Tilapia Mammals Fish
Applying TBA to vertebrate HOX clusters Human Mammals Fish
Assumption • The matching regions occur in the same order and orientation in all species. • Partial order • For a sequence S, S’s segments in block A precedes S’s segments in block B, and we say that block A precedes block B. • Local alignment • Pairwise alignment: BLATZ • Three or more sequences alignments: MULITZ
MULTIZ • Deals with alignments between three or more sequences . • MULTIZ • Merge two blocksets by assistance of another guiding blockset. • HUMOR • A specialized version of MULTIZ used in “The Rat Genome Sequencing Consortium 2003.s”
How does it work? Cont. • Proceeds in order along S (The reference for G, M and the output). • Access the corresponding (to S’s position) portion of N according to G. • Collect each aligned columns.
HUMOR • Stands for Human-Mouse-Rat • Starts with pairwise human-ref blocksets for human-mouse and for human-rat. • Trims columns from the ends of the blocks to make the human components identical. • Aligns the mouse and rat intervals to each other. • Aligns the human interval to the resulting mouse-rat block.
Evaluation of Alignment Accuracy • Simulate sequence evolution, starting with some ancestral sequence and performing mutation along the branches of a predetermined phylogenetic tree. • Use the agreement between the truth and the result as a scoring method.
Experimentalresults • Accuracy of the closely related sequences is better than more diverged ones. • TBA uniformly stands out for the more diverged pairs. • For most programs, their accuracy increases when there’re smaller number of species (indicates improvement, more species should have more information).
Experimental results • MULTIZ suffers mouse-rat alignment. • Human-rat is also slightly worse than the human-mouse alignment because rat is aligned to human only through mouse. • Score of 1.0 may be impossible to achieve, because a certain information is lost during sequence evolution. • Score of 1.0 is usually not necessary, some errors are inconsequential.
Experimental results • Running Time • Only the four programs (MULTIZ, TBA, MAVID, MLAGAN) actually designed for aligning large regions run fast enough. • MAVID super fast!