400 likes | 410 Views
Algorithms for Biological Sequence Analysis ─ Class Presentation. Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome Analysis. 許秉慧、陳怡靜、鄭智懷、宋建均 2005.11.30.
E N D
Algorithms for Biological Sequence Analysis─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome Analysis 許秉慧、陳怡靜、鄭智懷、宋建均 2005.11.30
S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, “Human-Mouse Alignments with BLASTZ,” Genome Research, 2003; 13: 103–107. • B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko, “Galazy: A Platform for Interactive Large-scale, Genome Analysis,” 2005; 15: 1451–1455.
MethodsS. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller Human-Mouse Alignments with BLASTZ Genome Research, 2003; 13: 103–107 陳怡靜、許秉慧 2005.11.30
Outline • Motivation and results • BLASTZ and modified BLASTZ • Implementation issues and hardware environment • Software evaluation
Motivation and Results 陳怡靜 2005.11.30
Motivation • Several existing programs sacrifice sensitivity to attain very short running time. • An appropriate level of sensitivity and specificity was attained by a program called BLASTZ. • A modified BLASTZ program attains efficiency adequate for aligning entire mammalian genomes and increasing its specificity.
Results • To modify the BLASTZ alignment program which is used by the PipMaker webserver (Schwartz et al. 2000) • The modified BLASTZ was used to compare all of the human sequence with all of the mouse efficiently.
BLASTZ and Modified BLASTZ 陳怡靜 2005.11.30
Homologous • Two proteins are orthologous if they belong to different species that evolve from a common ancestral gene by speciation and retain the same function in the course of evolution. • Two proteins are paralogous if they are duplicated within a genome and evolve new functions.
Human-Mouse Alignments • To find orthologous alignments • Natural consequence • We obtain the single best by applying a program, called axtBest, which filters out all but the best alignment within a sliding window of 10,000 bases. Mouse align Human Step1
BLASTZ • BLASTZ follows the three-step strategy used by Gapped BLAST. 1) Find short near-exact matches 2) Extend each short match without allowing gaps 3) Extend each gap-free match that exceeds a certain threshold by a DP procedure that permits gaps
BLASTZ • Two differences between BLASTZ and Gapped BLAST were exploited in the whole-genome alignments. • BLASTZ has an potion to require that the matching regions that it reports must occur in the same order and orientation in both sequences. Sequence 1 Sequence 2
BLASTZ • Two differences between BLASTZ and Gapped BLAST were exploited in the whole-genome alignments. • BLASTZ uses an alignment-scoring scheme derived and evaluated by Chiaromonte et al. (2000). Nucleotide substitutions are scored by the matrix and a gap of length k is penalized by subtracting 400 + 30k from the score. 100
Modified BLASTZ • The modified BLASTZ algorithm 1) Remove recent repeated elements 2) Run BLASTZ 3) Adjust positions in the alignment to refer to the original sequences 4) Filter the alignments
Modified BLASTZ • Step 1 (an addition from BLASTZ) • I. Y. Lee, D. Westaway, A. F. Smit, K. Wang, J. Seto, L. Chen, C. Acharya, M. Ankener, D. Baskin, C. Cooper, et at., “Complete Genomic Sequence and Analysis of the Prion Protein Gene Region from Three Mammalian Species,” Genome Research, 1998; 8: 1022–1037. WHY? Sequence 1 Sequence 2
Modified BLASTZ Each 12-mer allows a transition (A-G, G-A, C-T or T-c) in any one of the 12 positions. • Step 2 (a modification from BLASTZ) • Extend the induced alignment in each direction, not allowing gaps. • Stop extending when the score decrease more than some threshold. Sequence 1 12-mer 12-mer 12-mer Sequence 2 12-mer 12-mer 12-mer
Modified BLASTZ • Step 2 (a modification from BLASTZ) • If the gap-free alignment scores more than 3000 then • Repeat the extension step, but allow for gaps. • Retain the alignment if it scores above 5000. Sequence 1 12-mer 12-mer 12-mer Sequence 2 12-mer 12-mer 12-mer
l Modified BLASTZ • Step 3 • If l 50 kb, repeat Step 2, but using a more sensitive seeding procedure (ex. 7-mer exact matches) and lower score thresholds both for gap-free alignments (ex. 2200 instead of 3000) and for gapped alignments (ex. 2200 instead of 5000). Sequence 1 Sequence 2 10 2000 2000
Modified BLASTZ • Step 4: Adjust sequence positions in the resulting alignments to make them refer to the original sequences. • Step 5: Filter the alignments as appropriate for particular purposes. • Apply axtBest to finds a best way to align each aligned human position Sequence 1 Sequence 2 Sequence 1 Choose best one Sequence 2
Modified BLASTZ • Two changes to BLASTZ significantly improved itsexecution speed for aligning entire genomes. • When the program realized that many regions of the mouse genome align to the same human segment, that segment is dynamically masked. (Step 1 of the modified BLASTZ) • BLASTZ applies 8-mer procedure to align, but the modified BLASTZ applies 12-mer procedure to align. (Step 2 of the modified BLASTZ)
Implementation Issues and Hardware Environment 許秉慧 2005.11.30
Implementation Issues Human sequence Base 1 Gap-free segment score .>3000 10 kb Base 2 10 kb Base 3 1.01 Mb Mouse sequence
Implementation Issues and Hardware Environment • Input • 2.8Gb human sequence vs. 2.5Gb mouse sequence • Hardware • A cluster of 1024 833-Mhz Pentium III • Time • 481 days of CPU times • Half day of wall clock
Software Evaluation 許秉慧 2005.11.30
Software Evaluation • Different classes of parameters and thresholds might be best tested in different way • Reverse mouse sequence to measure specificity
Reverse Mouse Sequence True match microsatellite sequence Spurious match 3’ 5’ cacaca 3’ 5’ cacaca 3’ 5’ cacaca 5’ 3’ acacac Human sequence Mouse sequence Reverse Mouse sequence
Coverage by Outer Alignment 39.154% 0.164% -0.918% 0.037% -0.221% 0.075%
Coverage by Outer Alignment DNA sequence geno
ResourcesB. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko Galaxy: A Platform for Interactive Large-scale, Genome Analysis Genome Research, 2005; 15: 1451–1455 宋建均、鄭智懷 2005.11.30
What is Galaxy? • It’s a tool that it allows users to gather and manipulate data from existing resources in a variety of ways. • Galaxy contains three major classes of data manipulation: • Query operations • Sequence analysis tools • Output displays
Why needs Galaxy? • Galaxy differs from existing systems in its specificity for access to, and comparative analysis of, genomic sequence and alignments. • Programming experience is not required. • Galaxy is a web-based software which can handle large sequence data sets.
Query Operations • Complement: compiles a list of regions that do not overlap with the current query (requires UCSC library). • Restrict: filters data based on chromosome name and region size (requires UCSC library). • Merge overlapping regions: overlapping regions within a single query are consolidated into fewer, larger regions. (requires UCSC library). • Intersect: finds overlapping regions between two queries (requires UCSC library). • Union: to finds all regions that are covered by both of the queries, and return either merged regions or the original regions from one of the query (requires UCSC library).
Query Operations • Join Lists: joins two queries side by side to allow performing statistical analyses (requires UCSC library). • Cluster: finds clusters of regions within specified distance of each other (requires UCSC library). • Proximity: finds regions of one query within a specified distance of regions from another query (requires UCSC library). • Subtract: subtracts regions of one query from another query (requires UCSC library). • Join Same Coordinates Region: joins two queries, which have the same coordinates, side by side to allow performing statistical analyses (requires UCSC library).
Sequence Analysis Tools • Extract sequences: uses a perl wrapper written around fasta-subseq to extract sequences corresponding to bed file coordinates. Uses alignseq.loc file to locate genomic sequences. Requires PATH to include fasta-subseq location (requires perl) • Extract blastZ alignments: uses a perl wrapper for extractAxt (developed by Rico) to extract genomic alignments corresponding to bed file coordinates. Uses alignseq.loc to find axt files. Requires PATH to include extractAxt location (requires perl)
Output Displays • UCSC, Ensemble Genome Browser • EncodeDB at NEGRI • EnsMart at Sanger Centre
Language CGI PERL C CORE SQL Database
Other Features • Asynchronous query • User identity: cookies & assigning a sequential ID number to each terminal
Demo Thank you!