1 / 40

Algorithms for Biological Sequence Analysis ─ Class Presentation

Algorithms for Biological Sequence Analysis ─ Class Presentation. Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome Analysis. 許秉慧、陳怡靜、鄭智懷、宋建均 2005.11.30.

marchetti
Download Presentation

Algorithms for Biological Sequence Analysis ─ Class Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Biological Sequence Analysis─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome Analysis 許秉慧、陳怡靜、鄭智懷、宋建均 2005.11.30

  2. S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, “Human-Mouse Alignments with BLASTZ,” Genome Research, 2003; 13: 103–107. • B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko, “Galazy: A Platform for Interactive Large-scale, Genome Analysis,” 2005; 15: 1451–1455.

  3. MethodsS. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller Human-Mouse Alignments with BLASTZ Genome Research, 2003; 13: 103–107 陳怡靜、許秉慧 2005.11.30

  4. Outline • Motivation and results • BLASTZ and modified BLASTZ • Implementation issues and hardware environment • Software evaluation

  5. Motivation and Results 陳怡靜 2005.11.30

  6. Motivation • Several existing programs sacrifice sensitivity to attain very short running time. • An appropriate level of sensitivity and specificity was attained by a program called BLASTZ. • A modified BLASTZ program attains efficiency adequate for aligning entire mammalian genomes and increasing its specificity.

  7. Results • To modify the BLASTZ alignment program which is used by the PipMaker webserver (Schwartz et al. 2000) • The modified BLASTZ was used to compare all of the human sequence with all of the mouse efficiently.

  8. BLASTZ and Modified BLASTZ 陳怡靜 2005.11.30

  9. Homologous • Two proteins are orthologous if they belong to different species that evolve from a common ancestral gene by speciation and retain the same function in the course of evolution. • Two proteins are paralogous if they are duplicated within a genome and evolve new functions.

  10. Human-Mouse Alignments • To find orthologous alignments • Natural consequence • We obtain the single best by applying a program, called axtBest, which filters out all but the best alignment within a sliding window of 10,000 bases. Mouse align Human Step1

  11. BLASTZ • BLASTZ follows the three-step strategy used by Gapped BLAST. 1) Find short near-exact matches 2) Extend each short match without allowing gaps 3) Extend each gap-free match that exceeds a certain threshold by a DP procedure that permits gaps

  12. BLASTZ • Two differences between BLASTZ and Gapped BLAST were exploited in the whole-genome alignments. • BLASTZ has an potion to require that the matching regions that it reports must occur in the same order and orientation in both sequences. Sequence 1 Sequence 2

  13. BLASTZ • Two differences between BLASTZ and Gapped BLAST were exploited in the whole-genome alignments. • BLASTZ uses an alignment-scoring scheme derived and evaluated by Chiaromonte et al. (2000). Nucleotide substitutions are scored by the matrix and a gap of length k is penalized by subtracting 400 + 30k from the score. 100

  14. Modified BLASTZ • The modified BLASTZ algorithm 1) Remove recent repeated elements 2) Run BLASTZ 3) Adjust positions in the alignment to refer to the original sequences 4) Filter the alignments

  15. Modified BLASTZ • Step 1 (an addition from BLASTZ) • I. Y. Lee, D. Westaway, A. F. Smit, K. Wang, J. Seto, L. Chen, C. Acharya, M. Ankener, D. Baskin, C. Cooper, et at., “Complete Genomic Sequence and Analysis of the Prion Protein Gene Region from Three Mammalian Species,” Genome Research, 1998; 8: 1022–1037. WHY? Sequence 1 Sequence 2

  16. Modified BLASTZ Each 12-mer allows a transition (A-G, G-A, C-T or T-c) in any one of the 12 positions. • Step 2 (a modification from BLASTZ) • Extend the induced alignment in each direction, not allowing gaps. • Stop extending when the score decrease more than some threshold. Sequence 1 12-mer 12-mer 12-mer Sequence 2 12-mer 12-mer 12-mer

  17. Modified BLASTZ • Step 2 (a modification from BLASTZ) • If the gap-free alignment scores more than 3000 then • Repeat the extension step, but allow for gaps. • Retain the alignment if it scores above 5000. Sequence 1 12-mer 12-mer 12-mer Sequence 2 12-mer 12-mer 12-mer

  18. l Modified BLASTZ • Step 3 • If l 50 kb, repeat Step 2, but using a more sensitive seeding procedure (ex. 7-mer exact matches) and lower score thresholds both for gap-free alignments (ex. 2200 instead of 3000) and for gapped alignments (ex. 2200 instead of 5000). Sequence 1 Sequence 2 10 2000 2000

  19. Modified BLASTZ • Step 4: Adjust sequence positions in the resulting alignments to make them refer to the original sequences. • Step 5: Filter the alignments as appropriate for particular purposes. • Apply axtBest to finds a best way to align each aligned human position Sequence 1 Sequence 2 Sequence 1 Choose best one Sequence 2

  20. Modified BLASTZ • Two changes to BLASTZ significantly improved itsexecution speed for aligning entire genomes. • When the program realized that many regions of the mouse genome align to the same human segment, that segment is dynamically masked. (Step 1 of the modified BLASTZ) • BLASTZ applies 8-mer procedure to align, but the modified BLASTZ applies 12-mer procedure to align. (Step 2 of the modified BLASTZ)

  21. Implementation Issues and Hardware Environment 許秉慧 2005.11.30

  22. Implementation Issues Human sequence Base 1 Gap-free segment score .>3000 10 kb Base 2 10 kb Base 3 1.01 Mb Mouse sequence

  23. Implementation Issues and Hardware Environment • Input • 2.8Gb human sequence vs. 2.5Gb mouse sequence • Hardware • A cluster of 1024 833-Mhz Pentium III • Time • 481 days of CPU times • Half day of wall clock

  24. Software Evaluation 許秉慧 2005.11.30

  25. Software Evaluation • Different classes of parameters and thresholds might be best tested in different way • Reverse mouse sequence to measure specificity

  26. Reverse Mouse Sequence True match microsatellite sequence Spurious match 3’ 5’ cacaca 3’ 5’ cacaca 3’ 5’ cacaca 5’ 3’ acacac Human sequence Mouse sequence Reverse Mouse sequence

  27. Coverage by Outer Alignment 39.154% 0.164% -0.918% 0.037% -0.221% 0.075%

  28. Coverage by Outer Alignment DNA sequence geno

  29. Comparison of Genome Coverage

  30. Comparison of Covered Region

  31. ResourcesB. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko Galaxy: A Platform for Interactive Large-scale, Genome Analysis Genome Research, 2005; 15: 1451–1455 宋建均、鄭智懷 2005.11.30

  32. What is Galaxy? • It’s a tool that it allows users to gather and manipulate data from existing resources in a variety of ways. • Galaxy contains three major classes of data manipulation: • Query operations • Sequence analysis tools • Output displays

  33. Why needs Galaxy? • Galaxy differs from existing systems in its specificity for access to, and comparative analysis of, genomic sequence and alignments. • Programming experience is not required. • Galaxy is a web-based software which can handle large sequence data sets.

  34. Query Operations • Complement: compiles a list of regions that do not overlap with the current query (requires UCSC library). • Restrict: filters data based on chromosome name and region size (requires UCSC library). • Merge overlapping regions: overlapping regions within a single query are consolidated into fewer, larger regions. (requires UCSC library). • Intersect: finds overlapping regions between two queries (requires UCSC library). • Union: to finds all regions that are covered by both of the queries, and return either merged regions or the original regions from one of the query (requires UCSC library).

  35. Query Operations • Join Lists: joins two queries side by side to allow performing statistical analyses (requires UCSC library). • Cluster: finds clusters of regions within specified distance of each other (requires UCSC library). • Proximity: finds regions of one query within a specified distance of regions from another query (requires UCSC library). • Subtract: subtracts regions of one query from another query (requires UCSC library). • Join Same Coordinates Region: joins two queries, which have the same coordinates, side by side to allow performing statistical analyses (requires UCSC library).

  36. Sequence Analysis Tools • Extract sequences: uses a perl wrapper written around fasta-subseq to extract sequences corresponding to bed file coordinates. Uses alignseq.loc file to locate genomic sequences. Requires PATH to include fasta-subseq location (requires perl) • Extract blastZ alignments: uses a perl wrapper for extractAxt (developed by Rico) to extract genomic alignments corresponding to bed file coordinates. Uses alignseq.loc to find axt files. Requires PATH to include extractAxt location (requires perl)

  37. Output Displays • UCSC, Ensemble Genome Browser • EncodeDB at NEGRI • EnsMart at Sanger Centre

  38. Language CGI PERL C CORE SQL Database

  39. Other Features • Asynchronous query • User identity: cookies & assigning a sequential ID number to each terminal

  40. Demo Thank you!

More Related